Re: Avoiding OOM on overcommit...?

From: James Sutherland (jas88@cam.ac.uk)
Date: Fri Mar 24 2000 - 19:18:35 EST


On Fri, 24 Mar 2000 12:24:05 -0600 (CST), you wrote:
>James Sutherland <jas88@cam.ac.uk>:
>> On Tue, 21 Mar 2000 13:20:05 -0600 (CST), you wrote:
>> >James Sutherland <jas88@cam.ac.uk>:
>> >> On Mon, 20 Mar 2000 20:50:08 -0600, you wrote:
>> >> >On Mon, 20 Mar 2000, Horst von Brand wrote:
>> >> >>Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil> said:
(snip)
>> >Cray YMP-8 - reached 100% several years ago. Replace by C90, C90 replaced
>> >by dual SV1 system. SV1 system at 80%, expanding to 4 nodes.
>> >SGI Power Challenge array - reached 90% usage in 1 year. Replaced by Origin
>> >2000. Origin 2000 (128 node) updated when load reached 80% (with several
>> >OOM crashes due to managements choice to overcommit)
>>
>> What do you mean by "overcommit" in this case - the total online user
>> quota exceeding total VM, or the Linux (memory allocated on demand)
>> sense?
>
>Here overcomit is the total concurrent user quota exceeding total VM. When the
>users acutally use the memory allocated (memory allocated on demand) the
>systems can fail/hang/crash/random process abort.

That's a totally different meaning of the word, not the one everyone
else here is using - and if a simple failed userspace memory
allocation causes your system to crash, get a refund. It's terminally
broken.

>> And are you seriously telling me that if I take an Irix box, disable
>> user VM quotas and do a malloc() loop, the system crashes?! (Mental
>> note: Replace all Irix boxes with NT)
>
>note: malloc and use it, then it can. (NT can't do the work - too slow,
>insufficient number of CPUs, lack of multi-user identification, sinsuficient
>disk (although that is easing).

Oh dear. If you are right, SGI should try drug-testing their OS
programmers.

(snip)
>> >Of course it is. If too many users are on one system trying to do too much
>> >at one time the system fails.
>>
>> "The system fails"?? Either you have one broken system, or you have a
>> strange definition of "failure". The users should just get errors that
>> they don't have the resources available.
>
>The users sometimes get process aborts. The problem on a 128 cpu system is
>that it can fill faster than it can abort. The IRIX systems, for example,
>do not have as strong resource control as UNICOS (strong resource control).

Oh dear. Essentially, then, a race condition. We have a pair of
similar (but rather smaller) O2400s here; I don't remember this
problem being mentioned. OTOH, they are effectively single-user batch
jobs only. It's the Hitachi vector machine that causes true grief...

Anyway, that's just a bad security (DOS) hole in the OS. Not a design
flaw in the memory management system.

(snip)
>> Crashing the system should not be possible under any circumstances,
>> regardless of resource quotas. If a user-mode program can crash the
>> system using only unprivileged syscalls, you have found a nasty bug.
>
>Not really - even root uses unprivileged syscalls. When the system resources
>are exhausted, the system failes. Sometimes gracefully, sometimes not.

My point is, usermode problems should NEVER cause the system to crash
or otherwise go down. Obviously, privileged syscalls can do (if
"shutdown -h now" doesn't cause downtime, I want to know why!)

>> >> If users are unable to perform legitimate tasks due to a "lack of
>> >> resources", when the machine has ample resources to spare, I would
>> >> consider the system down. I would then fire the sysadmin who
>> >> configured it that inefficiently.
>> >
>> >Thats just it - the system was saturated.
>>
>> Your quota settings would cause the system to begin denying resource
>> allocation BEFORE the system is saturated.
>
>Yes. The goal is to do that within some percentage of the total. On a 1GB
>memory system I would like the user resource limits to prevent the last 10
>20 pages, the target systems I'm looking at (where it would be most usefull)
>there is > 1GB memory with more than 1 CPU (2 - 32) AND more than 10 users.
>This puts the burden of maximizing system usage on both the scheduler
>(interactive and batch). The batch scheduler has a much easier time since it
>can pick and choose jobs to run from start to finish. The interactive scheduler
>can only take first come first servered. To compair batch scheduler to an
>interactive scheduler it may be more proper to compair a bach scheduler with
>a combined inetd/telnetd/login group. I'm not pointing at the kernel process
>scheduler whose job is to maximize cpu utilization.

My point is, your quota settings are far too restrictive for a typical
multi-user system. Many of our users will be logged in for days on
end, just running Pine - or, frequently, just keeping a shell window
or two open. Do you keep their entire VM quota reserved for them all
that time?

>There are process schedulers that also schedule based on memory utilization
>too. Linux doesn't have that either, I'm also not fully convinced it needs
>one yet; although they can help better utilize memory better (IRIX seems
>to do this, but I don't think it does a good job).

I haven't used Irix for a while - some sort of programming course a
few years ago on Indys, IIRC. I wasn't very impressed then, TBH.

For batch jobs, certainly, being able to take into account multiple
factors would be a big asset - specify typical and maximum figures for
runtime required, memory usage and working set, interconnect bandwidth
etc. Far too much overhead for an interactive general purpose box, of
course, but ideal for things like our Origins...

>> >> However, if you REALLY insist on setting the system up so people can
>> >> login and display all sorts of nice OOM error messages on an idle
>> >> system, go ahead. I'm not stopping you.
>> >
>> >If it were idle there would be no OOM error messages :)
>>
>> Yes there would - you run out of YOUR memory, even when the system has
>> plenty to spare.
>
>An IDLE system has free CPU time as well as free memory available. Now if I
>am the only user on the system your statement is correct. The system being
>targeted will tend not to be idle - there may be availalbe CPU and memory -
>but I'm not the only user. I would expect to be one out of 5-20, where most
>the other users may be batch jobs.

Batch jobs, of course, are a very different kettle of fish; perhaps I
am biased here. The Linux boxes here are all general purpose
multi-user machines (e-mail, compilation, etc.), with a couple of big
Unix boxes (two Origin 2400s, a Hitachi vector machine, a few others.)
There is a pair of Beowulf clusters, too, but that's not used in the
same way.

Obviously, if a user is only "active" when running batch jobs of
roughly known resource usage, you can tailor resource limits to get
pretty efficient usage. This isn't the kind of area Linux is usually
used in, though.

>> >> >>> If you overcommit:
>> >> >>> You might support 100 simultaneous connections, but not full saturation
>> >> >>> of each server, without aborting some connections or crashing the
>> >> >>> server/system.
>> >> >>>
>> >> >>> Result: some unhappy customers, not so happy management, difficulty
>> >> >>> in identifying problems, and definitely no raise.
>> >> >>
>> >> >>Other customers are unhappy because of long download times, NS (or
>> >> >>whatever) crashing, not getting to the site because of (local or remote)
>> >> >>network congestion. The "catastrophe" isn't even a blip on the sysadmin's
>> >> >>screen.
>> >> >
>> >> >Yup - long download times since the system keeps crashing, forcing the
>> >> >customer to a different site and or different company.
>> >>
>> >> No, the system never gets as far as crashing, because the inept
>> >> sysadmin has already disabled it.
>> >
>> >not worth answering.
>>
>> Your approach, with a typical workload, would waste a large percentage
>> of the system's resources. Users rarely max out their quotas - but you
>> set the system to ensure that a user with a 128Mb quota has 128Mb
>> exclusively reserved for their use alone anyway?
>
>Where I work, users frequently max out their quotas, especially when they
>are interactive. The majority of the jobs are not interactive. (lots of
>VERY small jobs copying data, some edits, and batch monitoring. Simulations
>take a long time to setup, they take a lot of memory/cpu, but small interactive
>needs.
>
>I exceed my quota almost everytime I attempt to view syslog with vi, unless
>it has just been cycled. I try anyway, since it is very convienent when it
>works.

In other words, average resource usage as a % of quota is very high?
That would be the exception, rather than the rule, I think; your quota
settings would simply not be practical on most of our systems.

>> >> >The catastrophe is loss of business, loss of reputation, going out of
>> >> >business because you can't keep the customers.
>> >>
>> >> The customers have already gone somewhere they can actually run
>> >> software without OOM errors.
>> >
>> >not your site - it keeps crashing.
>>
>> Why? If I have the same level of resources as you, so the same
>> workload will succeed without any problems.
>
>No it won't - We force our users to schedule access to run large memory
>jobs. You don't. If two of our users ran their job simultaneously, most
>likely both will abort.

No - only one. The same one your system would have allowed to run in
the first place. The second user must then wait until enough resources
are available.

> On our Origin, we have a total memory of 64GB.
>We allow our users to run a process of up to 32GB. Our batch system
>limits the total number of (big) processes to 1. Our interactive users
>are limited to 15 minutes of cpu time. When it becomes necessary
>we will limit the total number of interactive users (its already limited,
>but much higher than I like -- if they all logged in at once then the
>system could crash - management choice, definitly not mine). Due to the
>way the system IS overcommited (memory and interactive) I have seen
>process failures (one crash, inetd, cron, sendmail ... all have failed
>at various times). All of these failures have been attributed to
>oversubscription of resources.

The crash is certainly a problem - what did SGI say? inetd, cron etc.
terminating shouldn't be a big issue, though, they could just respawn
from init.

It looks like this bug you've found in Irix is causing serious
problems; Rik's OOM killer would have helped here, properly managed
(with the policy support we've been discussing). Lowering resource
limits would hide the problem, but you really need to solve the
problem, rather than just avoiding triggering it...

>Special arrangements can be made to run a process up 60GB (we shutdown
>all batch processing and run a special queue with only one job at a time).
>The user submits his job at any time, it just won't run because the resources
>are not available.
>
>> You still don't appear to understand what overcommit means in this
>> context.
>
>"overcommit": The sum of all concurrent process virtual memory ranges (in
> pages) exceeds the number of pages in main memory plus swap
> memory.
>
>virtual memory range: That addressable memory requested by a process and
> allocated to the process by the kernel via sbrk, fork, exec,
> and stack. (mmap should be included too, but I don't know
> for sure yet.)

The issue being referred to as "overcommit" in Linux is the fact Linux
does not include unused malloc() address space in this, since there is
nothing in that space yet. If your system is forced to be
overcommitted by your definition, you DO have problems.

>There is still some memory not accounted for here - memory buffers for
>I/O, kernel addressing requirements, TCP, driver pools, etc.
>
>The primary usage of memory is covered.

This is not the usage in this thread.

James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:14 EST