On Mon, 13 Mar 2000, James Sutherland wrote:
>On Mon, 13 Mar 2000, David Whysong wrote:
>> And please explain why my simulation -- that may have started many weeks
>> (or months) ago -- should "just exit" because some random 5-minute old
>> Mathematica process went and allocated half a gigabyte of memory?
>Quite. Ideally, the process will get a SIGTERM first, giving it an
>opportunity to save and exit safely. If this fails (if, for example, it
>needs to allocate memory to do so) it will then be SIGKILLed.
But that's the wrong thing to do! For example, I start a simulation that
runs for two months. Then some one starts an app that gobbles memory. My
long-running simulation malloc's 5 bytes and runs out of memory.
Why should I be penalized for somebody else's problem? SIGKILL the
offending app, not some large but otherwise well-behaved and valuable
>This will be rather difficult to implement, but ultimately the scenario I
>would like is:
>App 1 fails to allocate memory - OOM handler triggered.
>* Hook at this point to bring in additional swap space?
Pointless. This just delays the inevitable. Why not just mount all your
swap in the first place? If you have more, then you aren't really OOM.
>OOM handler selects several processes to terminate to free up VM.
Why several? You're more likely to kill "innocent" processes.
[SNIP rest of far too complicated plan to handle OOM situations...]
>I should point out, though, that from the very beginning of this scenario,
>we are looking at a fairly hosed machine with dying processes. We may be
>able to recover by bringing additional swapspace online
So you fix an OOM condition by dynamically adding memory? That seems
backwards. If you had extra swap, why not use it in the first place.
>and/or killing suspect processes, but urgent attention is needed.
The only possibilities are to (a) enforce hard memory limits with no
overcommit, thereby wasting large amounts of swap space that will never
get used while not really solving the problem, or (b) killing the
offending process(es). Rik's approach seems to be a good implementation of
(b); in fact, I am starting to believe that his general approach (and I
say nothing about his implementation here) is the ONLY reasonable way to
deal with OOM situations.
David Whysong firstname.lastname@example.org
Astrophysics graduate student University of California, Santa Barbara
My public PGP keys are on my web page - http://www.physics.ucsb.edu/~dwhysong
DSS PGP Key 0x903F5BD6 : FE78 91FE 4508 106F 7C88 1706 B792 6995 903F 5BD6
D-H PGP key 0x5DAB0F91 : BC33 0F36 FCCD E72C 441F 663A 72ED 7FB7 5DAB 0F91
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to email@example.com
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:16 EST