SMP 2.2.14/15pre10: fork() fails with -ENOMEM during alloc_task_struct

From: kahuna@panix.com
Date: Mon Feb 28 2000 - 17:22:47 EST


Greetings,

While running each of the kernels 2.2.13, 2.2.14, 2.2.15pre9, and
2.2.15pre10, I have encountered conditions on an SMP 4-CPU machine
where the fork() system call fails to allocate a task_alloc_struct
and returns ENOMEM to the caller.

At the time of the failed system call, there is plenty of memory
(128meg of 1 gig), and certainly another gigabyte of swap (300k
used) at the time of the no memory error.

Other things I have checked are:
* there are far less than 512 processes running (NR_TASK = 4090);
* there is plenty of free memory immediately following the failed system call;
* there doesn't appear to be inordinate memory leaks in the kernel itself
- but how to tell?

I attempted to place a spinlock around the old code that is in
arch/i386/kernel/process.c (which is deactivated in SMP mode
on the assumption that get_free_page(GFP_KERNEL) will return
the correct value) which maintains a small stack of 16 previously
allocated task_alloc_struct's, but thought of certain deadlock
SMP race conditions on the way home that evening, and sure enough -
the machine locked solid a few hours later.

Things I will be trying in the meantime:
* change the application to try again to fork "a little later" "several" times;
* change the application to not abort on a failed fork;
* change the application's parent to not abort when its co-routine aborts;
* change the application's parent to not need the forking process in the
   first place.

These are all work-arounds for what should be a non-error in the first place.

So my questions are:

1. under what conditions would get_free_page(GFP_KERNEL) return
NULL (no available page) when it appears there is free memory?

2. other than watching the output of Shift-ScrollLock periodically,
what other ways are there to spot memory usage in the kernel?

3. 'ps -vax' shows interesting behaviour, but the sum-total of
each of RSS and allocated memory are less than physical memory,
and 'swapon -l' shows very little allocated swap. It doesn't
appear that I am trigger the OOM kill-process behaviour of
the kernel from being overcommitted. How can I confirm this?

4. Is anyone else interested in tackling this sort of problem in the
developer community? The application is a massively multi-player
game. :)

Best regards,
Andy

PS - I read the list via muc.lists.linux-kernel, but emailed responses
will also be watched - assuming I have procmail configured correctly.
:) -A

-- 
Andrew Finkenstadt (http://www.finkenstadt.com/andy/)

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 29 2000 - 21:00:20 EST