Re: Avoiding OOM on overcommit...?

From: Jesse Pollard (pollard@cats-chateau.net)
Date: Sat Mar 25 2000 - 11:35:26 EST


On Fri, 24 Mar 2000, James Sutherland wrote:
>On Fri, 24 Mar 2000 12:24:05 -0600 (CST), you wrote:
>>James Sutherland <jas88@cam.ac.uk>:
>>> On Tue, 21 Mar 2000 13:20:05 -0600 (CST), you wrote:
>>> >James Sutherland <jas88@cam.ac.uk>:
>>> >> On Mon, 20 Mar 2000 20:50:08 -0600, you wrote:
>>> >> >On Mon, 20 Mar 2000, Horst von Brand wrote:
>>> >> >>Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil> said:
>(snip)
>>> >Cray YMP-8 - reached 100% several years ago. Replace by C90, C90 replaced
>>> >by dual SV1 system. SV1 system at 80%, expanding to 4 nodes.
>>> >SGI Power Challenge array - reached 90% usage in 1 year. Replaced by Origin
>>> >2000. Origin 2000 (128 node) updated when load reached 80% (with several
>>> >OOM crashes due to managements choice to overcommit)
>>>
>>> What do you mean by "overcommit" in this case - the total online user
>>> quota exceeding total VM, or the Linux (memory allocated on demand)
>>> sense?
>>
>>Here overcomit is the total concurrent user quota exceeding total VM. When the
>>users acutally use the memory allocated (memory allocated on demand) the
>>systems can fail/hang/crash/random process abort.
>
>That's a totally different meaning of the word, not the one everyone
>else here is using - and if a simple failed userspace memory
>allocation causes your system to crash, get a refund. It's terminally
>broken.

Don't change topics. that is what overcommit in this case is. It can
crash systems BECAUSE THE SYSTEM WAS TOLD TO ALLOCATE MORE MEMORY THAN WAS
AVAILABLE.

>>> And are you seriously telling me that if I take an Irix box, disable
>>> user VM quotas and do a malloc() loop, the system crashes?! (Mental
>>> note: Replace all Irix boxes with NT)
>>
>>note: malloc and use it, then it can. (NT can't do the work - too slow,
>>insufficient number of CPUs, lack of multi-user identification, sinsuficient
>>disk (although that is easing).
>
>Oh dear. If you are right, SGI should try drug-testing their OS
>programmers.

Besides doing so anyway... Any overcommited resource causes system problems,
some more than others.

>(snip)
>>> >Of course it is. If too many users are on one system trying to do too much
>>> >at one time the system fails.
>>>
>>> "The system fails"?? Either you have one broken system, or you have a
>>> strange definition of "failure". The users should just get errors that
>>> they don't have the resources available.
>>
>>The users sometimes get process aborts. The problem on a 128 cpu system is
>>that it can fill faster than it can abort. The IRIX systems, for example,
>>do not have as strong resource control as UNICOS (strong resource control).
>
>Oh dear. Essentially, then, a race condition. We have a pair of
>similar (but rather smaller) O2400s here; I don't remember this
>problem being mentioned. OTOH, they are effectively single-user batch
>jobs only. It's the Hitachi vector machine that causes true grief...
>
>Anyway, that's just a bad security (DOS) hole in the OS. Not a design
>flaw in the memory management system.

No - that is a fault of management authorizing overcommitting resources.
This is the same problem linux has, but without the management authorization.

>(snip)
>>> Crashing the system should not be possible under any circumstances,
>>> regardless of resource quotas. If a user-mode program can crash the
>>> system using only unprivileged syscalls, you have found a nasty bug.
>>
>>Not really - even root uses unprivileged syscalls. When the system resources
>>are exhausted, the system failes. Sometimes gracefully, sometimes not.
>
>My point is, usermode problems should NEVER cause the system to crash
>or otherwise go down. Obviously, privileged syscalls can do (if
>"shutdown -h now" doesn't cause downtime, I want to know why!)

User mode problems can allways cause the system to crash, if resources are
overcommited (memory in particular) - either directly due to the system going
into a deadlock hang, or directly, by having init fail.

>>> >> If users are unable to perform legitimate tasks due to a "lack of
>>> >> resources", when the machine has ample resources to spare, I would
>>> >> consider the system down. I would then fire the sysadmin who
>>> >> configured it that inefficiently.
>>> >
>>> >Thats just it - the system was saturated.
>>>
>>> Your quota settings would cause the system to begin denying resource
>>> allocation BEFORE the system is saturated.
>>
>>Yes. The goal is to do that within some percentage of the total. On a 1GB
>>memory system I would like the user resource limits to prevent the last 10
>>20 pages, the target systems I'm looking at (where it would be most usefull)
>>there is > 1GB memory with more than 1 CPU (2 - 32) AND more than 10 users.
>>This puts the burden of maximizing system usage on both the scheduler
>>(interactive and batch). The batch scheduler has a much easier time since it
>>can pick and choose jobs to run from start to finish. The interactive scheduler
>>can only take first come first servered. To compair batch scheduler to an
>>interactive scheduler it may be more proper to compair a bach scheduler with
>>a combined inetd/telnetd/login group. I'm not pointing at the kernel process
>>scheduler whose job is to maximize cpu utilization.
>
>My point is, your quota settings are far too restrictive for a typical
>multi-user system. Many of our users will be logged in for days on
>end, just running Pine - or, frequently, just keeping a shell window
>or two open. Do you keep their entire VM quota reserved for them all
>that time?

Under UNICOS - yes.
Under IRIX - no, managemnt has directed that overcommitting memory resources
to be permitted.
Under Linux - no, no choice.

>>There are process schedulers that also schedule based on memory utilization
>>too. Linux doesn't have that either, I'm also not fully convinced it needs
>>one yet; although they can help better utilize memory better (IRIX seems
>>to do this, but I don't think it does a good job).
>
>I haven't used Irix for a while - some sort of programming course a
>few years ago on Indys, IIRC. I wasn't very impressed then, TBH.

The courses never covered the larger systems, unfortunately. I walked into
the "advanced system administration" and asked a simple question: "How do I
add 1500 user accounts using IRIX user administration tools?" After a
2-3 minute pause the answer "Not that way".

>For batch jobs, certainly, being able to take into account multiple
>factors would be a big asset - specify typical and maximum figures for
>runtime required, memory usage and working set, interconnect bandwidth
>etc. Far too much overhead for an interactive general purpose box, of
>course, but ideal for things like our Origins...

yes - now if irix would fully implement the API to get to the kerne to set
the limits... (they are there, but not available for batch systems to use -
it has to be done via systune, which is not exactly dynamic).

>>> >> However, if you REALLY insist on setting the system up so people can
>>> >> login and display all sorts of nice OOM error messages on an idle
>>> >> system, go ahead. I'm not stopping you.
>>> >
>>> >If it were idle there would be no OOM error messages :)
>>>
>>> Yes there would - you run out of YOUR memory, even when the system has
>>> plenty to spare.
>>
>>An IDLE system has free CPU time as well as free memory available. Now if I
>>am the only user on the system your statement is correct. The system being
>>targeted will tend not to be idle - there may be availalbe CPU and memory -
>>but I'm not the only user. I would expect to be one out of 5-20, where most
>>the other users may be batch jobs.
>
>Batch jobs, of course, are a very different kettle of fish; perhaps I
>am biased here. The Linux boxes here are all general purpose
>multi-user machines (e-mail, compilation, etc.), with a couple of big
>Unix boxes (two Origin 2400s, a Hitachi vector machine, a few others.)
>There is a pair of Beowulf clusters, too, but that's not used in the
>same way.
>
>Obviously, if a user is only "active" when running batch jobs of
>roughly known resource usage, you can tailor resource limits to get
>pretty efficient usage. This isn't the kind of area Linux is usually
>used in, though.

Not yet. Resource limits would allow such in a Beowulf cluster though. They
seem to look more like IBM SP systems ( I haven't yet used either, our SP
doesn't arrive until sometime this summer).

>>> >> >>> If you overcommit:
>>> >> >>> You might support 100 simultaneous connections, but not full saturation
>>> >> >>> of each server, without aborting some connections or crashing the
>>> >> >>> server/system.
>>> >> >>>
>>> >> >>> Result: some unhappy customers, not so happy management, difficulty
>>> >> >>> in identifying problems, and definitely no raise.
>>> >> >>
>>> >> >>Other customers are unhappy because of long download times, NS (or
>>> >> >>whatever) crashing, not getting to the site because of (local or remote)
>>> >> >>network congestion. The "catastrophe" isn't even a blip on the sysadmin's
>>> >> >>screen.
>>> >> >
>>> >> >Yup - long download times since the system keeps crashing, forcing the
>>> >> >customer to a different site and or different company.
>>> >>
>>> >> No, the system never gets as far as crashing, because the inept
>>> >> sysadmin has already disabled it.
>>> >
>>> >not worth answering.
>>>
>>> Your approach, with a typical workload, would waste a large percentage
>>> of the system's resources. Users rarely max out their quotas - but you
>>> set the system to ensure that a user with a 128Mb quota has 128Mb
>>> exclusively reserved for their use alone anyway?
>>
>>Where I work, users frequently max out their quotas, especially when they
>>are interactive. The majority of the jobs are not interactive. (lots of
>>VERY small jobs copying data, some edits, and batch monitoring. Simulations
>>take a long time to setup, they take a lot of memory/cpu, but small interactive
>>needs.
>>
>>I exceed my quota almost everytime I attempt to view syslog with vi, unless
>>it has just been cycled. I try anyway, since it is very convienent when it
>>works.
>
>In other words, average resource usage as a % of quota is very high?
>That would be the exception, rather than the rule, I think; your quota
>settings would simply not be practical on most of our systems.

Many systems just don't need it.

>>> >> >The catastrophe is loss of business, loss of reputation, going out of
>>> >> >business because you can't keep the customers.
>>> >>
>>> >> The customers have already gone somewhere they can actually run
>>> >> software without OOM errors.
>>> >
>>> >not your site - it keeps crashing.
>>>
>>> Why? If I have the same level of resources as you, so the same
>>> workload will succeed without any problems.
>>
>>No it won't - We force our users to schedule access to run large memory
>>jobs. You don't. If two of our users ran their job simultaneously, most
>>likely both will abort.
>
>No - only one. The same one your system would have allowed to run in
>the first place. The second user must then wait until enough resources
>are available.

Several of your users submit large jobs, they just run serially instead of
in parallel.

>> On our Origin, we have a total memory of 64GB.
>>We allow our users to run a process of up to 32GB. Our batch system
>>limits the total number of (big) processes to 1. Our interactive users
>>are limited to 15 minutes of cpu time. When it becomes necessary
>>we will limit the total number of interactive users (its already limited,
>>but much higher than I like -- if they all logged in at once then the
>>system could crash - management choice, definitly not mine). Due to the
>>way the system IS overcommited (memory and interactive) I have seen
>>process failures (one crash, inetd, cron, sendmail ... all have failed
>>at various times). All of these failures have been attributed to
>>oversubscription of resources.
>
>The crash is certainly a problem - what did SGI say? inetd, cron etc.
>terminating shouldn't be a big issue, though, they could just respawn
>from init.

I think the crash simply recorded "init terminated". No bus serror,
no error messages, just a reboot.

Unfortunately, the structure SGI uses to start these processes doesn't
allow init to restart them. The "official" restart is to reboot, or
reinitialize the network. To get that to work reliably (it usually works
to just "/etc/init.d/network start" but some daemons (RPC ones?) don't
always get the message. Then the only way to fix it is to "network stop;
"network start". IRIX really has some bugs in init too - a "telinit q"
or kill -1 "init pid" doesn't always re-read the config file correctly.
(same goes for inetd... which is worse since it is done more often).

>It looks like this bug you've found in Irix is causing serious
>problems; Rik's OOM killer would have helped here, properly managed
>(with the policy support we've been discussing). Lowering resource
>limits would hide the problem, but you really need to solve the
>problem, rather than just avoiding triggering it...

It really wouldn't happen if we didn't overcommit memory. If sufficent
swap space (or at least more). The SGI analyst did get convenced to add
another 12 GB swap space, which reduced the failure rate to something more
acceptable.

>>Special arrangements can be made to run a process up 60GB (we shutdown
>>all batch processing and run a special queue with only one job at a time).
>>The user submits his job at any time, it just won't run because the resources
>>are not available.
>>
>>> You still don't appear to understand what overcommit means in this
>>> context.
>>
>>"overcommit": The sum of all concurrent process virtual memory ranges (in
>> pages) exceeds the number of pages in main memory plus swap
>> memory.
>>
>>virtual memory range: That addressable memory requested by a process and
>> allocated to the process by the kernel via sbrk, fork, exec,
>> and stack. (mmap should be included too, but I don't know
>> for sure yet.)
>
>The issue being referred to as "overcommit" in Linux is the fact Linux
>does not include unused malloc() address space in this, since there is
>nothing in that space yet. If your system is forced to be
>overcommitted by your definition, you DO have probems.

It is covered by the definition - the process has been told the memory
is available and gave a pointer to it. Between the time the kernel gave
the resource and the time it used it, it may have been taken back.

One place this occurs is in application specific memory allocators - the
application may allocate a memory pool for it to use (see Lisp GC for an
example). The application (LISP) may not use it immediately, but does dole
it out as the users code requires. During LISP garbage collection aborts
because it cannot condense two chunks of memory because the second chunk
may not have been used yet. (depending on the size of the chunk of course).

>>There is still some memory not accounted for here - memory buffers for
>>I/O, kernel addressing requirements, TCP, driver pools, etc.
>>
>>The primary usage of memory is covered.
>
>This is not the usage in this thread.

I think it is. The kernel is out of resources because it granted user
processes access to more memory than the kernel could give.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@cats-chateau.net

Any opinions expressed are solely my own.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:15 EST