Hanging problems in 2.0.30 -- discoveries

Philip Gladstone (philip@raptor.com)
Fri, 01 Aug 1997 21:19:11 -0400


I can persuade my system to hang repeatedly under 2.0.30-pre2
under heavy load. It turns out that it doesn't really
hang, but a wait_queue becomes corrupted and the kernel
goes into an infinite loop trying to take something
off the queue. [I added checks at add time to ensure that
the wait queue is acceptably short (less than 1000 entries)].

This gets triggered repeatedly. It turns out that when this
happens, memory starvation has kicked in, and the kernel is
trying to handle a page fault -- however the current->mm
pointer is not pointing at anything useful -- hence the
'down' operation is doomed.

My environment uses clone calls -- so there is no real reason
why this process shouldn't have a valid mm structure. It is
noticeable that the reference count increment doesn't happen
until quite late on in the cloning operation. In particular,
it looks as though the kernel stack page pointer (which
is allocated in the cloning process) overlaps with what (used to be)
the parents mm pointer. I suspect that some of the cloning
memory allocations are blocking, but I don't know.

Something bad is going on.... Does anybody have any ideas?

Philip

The parent clones the child.

-- 
Philip Gladstone                           +1 617 487 7700
Raptor Systems, Waltham, MA         http://www.raptor.com/