Re: Overcommitable memory??

From: James Sutherland (jas88@cam.ac.uk)
Date: Mon Mar 13 2000 - 08:24:07 EST


On Mon, 13 Mar 2000, David Whysong wrote:

> On 13 Mar 2000, Rask Ingemann Lambertsen wrote:
> >
> >Apps would be told that the system is out of memory instead of just
> >getting a SIGKILL'ed out of the blue sky. Apps getting NULL from
> >malloc() can react appropriately, such as saving your files to disk,
> >trying again a little later or just exiting if that is acceptable for
> >what the app was doing. Apps getting SIGKILL will take your unsaved work
> >with them in the fall.
>
> Ok, so my big gravitational simulation gets NULL from malloc and
> decides to save it's work and exit. Uh-oh, time to demand-load a page of
> executable code that had been discarded, so we can save the data. Hmm, but
> we're out of memory...
>
> Even if that succeeds, or there is a foolish "no overcommit" policy, we
> need disk buffers. What if the program was told to save output to a SCSI
> device, and the kernel needs to load the driver module? We're out of
> memory! Even if we all build non-modular kernels, the kernel does some
> dynamic memory allocation.
>
> As for "trying again a little later", that leaves you with an unresponsive
> and unusable system in many cases.
>
> And please explain why my simulation -- that may have started many weeks
> (or months) ago -- should "just exit" because some random 5-minute old
> Mathematica process went and allocated half a gigabyte of memory?

Quite. Ideally, the process will get a SIGTERM first, giving it an
opportunity to save and exit safely. If this fails (if, for example, it
needs to allocate memory to do so) it will then be SIGKILLed.

This will be rather difficult to implement, but ultimately the scenario I
would like is:

App 1 fails to allocate memory - OOM handler triggered.
* Hook at this point to bring in additional swap space?
OOM handler selects several processes to terminate to free up VM.
OOM handler SIGTERMs these processes.
Some processes try to allocate more memory to handle SIGTERM, and are
stopped.
Other processes exit on SIGTERM, freeing up memory.
Remaining SIGTERMed processes now unfrozen, and may be able to allocate
memory to exit safely.
First app can now allocate memory (unless it was terminated too) and
continue as normal. If we are still hosed (if it was a kernel VM leak,
say, or some other process has now exploded) continue trying to kill
processes safely, until we have recovered or we give up. (Reboot at this
point?? Or change runlevel??)

I should point out, though, that from the very beginning of this scenario,
we are looking at a fairly hosed machine with dying processes. We may be
able to recover by bringing additional swapspace online and/or killing
suspect processes, but urgent attention is needed.

James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Mar 15 2000 - 21:00:24 EST