Gerhard Niklasch on Thu, 30 Jul 1998 15:32:08 +0200 (MET DST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: SEGV on bad recursion


In response to:
> Message-Id: <19980728150605.Z10326@io.txc.com>
> Date: Tue, 28 Jul 1998 15:06:05 -0400
> From: Igor Schein <igor@txc.com>
> 
> \\ f() is an recursively defined identity map on Z+ \ {1}
> f(n)=if(n==2,2,f(n-1)+1)
> f(1)
> 
> That's where gp segfaults and dies.

Here's what we found between us with a lot of testing, debugger-
exercising, and near-crashing a few workstations.

(1) Of course we don't expect this f(1) to return a useful result, ever.

(2) The parser dutifully descends into the recursion, allocating a little
parameter block on the PARI stack, copying the current value of n into
that, pushing the little stack of values associated to the variable n,
copying the string `if(n==2,2,f(n-1)+1)' to a chunk of malloc()ed memory,
and calling itself recursively on that.  (Not quite in this order, and
I'm skipping a number of steps.  There are actually about 8 nested calls
of C functions in the parser for each recursion level.)

(3) Thus the C stack, which is an animal entirely separate from the
PARI stack, is filling up with call frames of C functions.

(4) On SunOS/Solaris where Igor originally observed this behavior,
the C stack sits in a fixed-size (8MBy) memory segment, and fills up
this segment before the parameter blocks exhaust the default PARI stack,
and before the other allocations exhaust the available virtual memory.

(5) When this happens, a SIGSEGV is raised.  Actually, we first get
a segmentation exception from the memory management unit of the processor,
which the Solaris kernel handles on its own little interrupt stack, and
then the kernel tries to return control to gp's SIGSEGV handler in userland.

(6) However, the signal handler is now faced with an already overflowing
C stack, and cannot allocate its own stack frame, so the whole affair
dumps core.  (Actually, the kernel probably notices this before returning
control to the gp process at user level, and initiates the core dump at
once.)

The behavior varies a bit depending on the operating system.  Under Linux,
there's no hard limit at 8MBy, and the process goes on for quite a bit
longer, eating more and more virtual memory and slowing down the machine
on which it is running to a crawl unless interrupted in time.  (However,
the Solaris phenomena happen both on UltraSparcs and on Solaris PCs, so
the processor architecture plays a minor role here.)

There doesn't appear to be a reliable and portable way to catch the
problem just before we are beyond recovery.  (Repeat: we _are_ catching
SIGSEGVs, but when we're out of space on the C stack, the signal handler
cannot run.)  If someone does know of a reliable and portable way, we'd
appreciate hearing about it.

This leaves us with a bit of a dilemma.  On the one hand, it is perfectly
within the PARI philosophy to give users enough rope to hang themselves
if they are so inclined.  On the other, some users might appreciate some
sort of safety valve.  It would not be hard to add another gp default
setting  (changeable at runtime)  which would limit the maximal nesting
depth of gp user functions, say to a few thousand  (4000 ought to be safe
in all environments we have tried this on).  Then if someone desperately
needs a depth of 4002, they could change the default parameter, and get
what they want, and if someone absolutely wants to shoot him-, her- or
itself in the foot, they could set it to 0 meaning `unlimited', and
crash their own machine or at least their gp process.

Exceeding the nesting depth would _usually_ be a symptom of erroneous
(or at least ill-conceived, if not, as in Igor's case, intentionally
malicious ;^)  input to gp, anyway.

Comments?  Opinions?

Gerhard