Re: Avoiding OOM on overcommit...?

From: Jesse Pollard (pollard@tomcat.admin.navo.hpc.mil)
Date: Fri Mar 24 2000 - 13:24:05 EST


James Sutherland <jas88@cam.ac.uk>:
> On Tue, 21 Mar 2000 13:20:05 -0600 (CST), you wrote:
> >James Sutherland <jas88@cam.ac.uk>:
> >> On Mon, 20 Mar 2000 20:50:08 -0600, you wrote:
> >> >On Mon, 20 Mar 2000, Horst von Brand wrote:
> >> >>Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil> said:
> >> >>
> >> >>[...]
> >> >>
> >> >>> Without overcommit:
> >> >>> You can support 100 simultaneous connections, with full saturation of
> >> >>> each server, with no failures.
> >> >>>
> >> >>> Result: happy customers, happy management, maybe even a raise.
> >> >>
> >> >>Management nagging about supporting more pages, worried by waste of several
> >> >>Gb of disk that has never, ever been touched. System is sluggish, needs
> >> >>constant tweaking of "resource allocation quotas" as applications crash,
> >> >>even there are resources available. Seriously consider firing inept
> >> >>sysadmin.
> >> >
> >> >Disk space is cheap.
> >>
> >> Still not cheap enough to throw away, unless you are Boeing.
> >>
> >> >System is no more sluggish than currently, and is much more reliable.
> >>
> >> It is slower, because of the extra overhead you have imposed on every
> >> memory allocation. And it is no more reliable, except in OOM
> >> situations - which your draconian resource quotas make MUCH more
> >> frequent for the users.
> >>
> >> >First, "constant tweaking" means the system was/is undersized, and the
> >> >accounting records justify the additional business support for expansion.
> >>
> >> Point to a system which has never passed 50% of VM in use, and your
> >> request will be laughed out of purchasing.
> >
> >Cray YMP-8 - reached 100% several years ago. Replace by C90, C90 replaced
> >by dual SV1 system. SV1 system at 80%, expanding to 4 nodes.
> >SGI Power Challenge array - reached 90% usage in 1 year. Replaced by Origin
> >2000. Origin 2000 (128 node) updated when load reached 80% (with several
> >OOM crashes due to managements choice to overcommit)
>
> What do you mean by "overcommit" in this case - the total online user
> quota exceeding total VM, or the Linux (memory allocated on demand)
> sense?

Here overcomit is the total concurrent user quota exceeding total VM. When the
users acutally use the memory allocated (memory allocated on demand) the
systems can fail/hang/crash/random process abort.

> And are you seriously telling me that if I take an Irix box, disable
> user VM quotas and do a malloc() loop, the system crashes?! (Mental
> note: Replace all Irix boxes with NT)

note: malloc and use it, then it can. (NT can't do the work - too slow,
insufficient number of CPUs, lack of multi-user identification, sinsuficient
disk (although that is easing).

> > Added another 128 nodes,
> >and a TB of disk (30-40GB used for swap.. still overcommitted, but not as
> >severly). I have more.
> >
> >>
> >> >Second, If it is determined that overcommitment is necessary, then it can
> >> >still be done - along with the documentation on what happens when the system
> >> >crashes.
> >>
> >> Overcommit does not cause system crashes; nor is it related in any way
> >> to user resource quotas.
> >
> >Of course it is. If too many users are on one system trying to do too much
> >at one time the system fails.
>
> "The system fails"?? Either you have one broken system, or you have a
> strange definition of "failure". The users should just get errors that
> they don't have the resources available.

The users sometimes get process aborts. The problem on a 128 cpu system is
that it can fill faster than it can abort. The IRIX systems, for example,
do not have as strong resource control as UNICOS (strong resource control).

> > Resource quotas force the users to schedule
> >their work, and to cooperate with the system to accomplish the goals of
> >business management.
> >
> >> >I've never been fired for guaranteeing system uptime.
> >>
> >> You should be, if you do so by disabling legitimate use of the system.
> >
> >I didn't - legitimate use was guaranteed. Crashing the system wasn't one
> >of the legitimate uses. Nor was causing the failure of other users processes.
>
> Crashing the system should not be possible under any circumstances,
> regardless of resource quotas. If a user-mode program can crash the
> system using only unprivileged syscalls, you have found a nasty bug.

Not really - even root uses unprivileged syscalls. When the system resources
are exhausted, the system failes. Sometimes gracefully, sometimes not.

> >> >Obviously you don't work where uptime is counted in the thousands of dollars
> >> >per hour.
> >>
> >> If users are unable to perform legitimate tasks due to a "lack of
> >> resources", when the machine has ample resources to spare, I would
> >> consider the system down. I would then fire the sysadmin who
> >> configured it that inefficiently.
> >
> >Thats just it - the system was saturated.
>
> Your quota settings would cause the system to begin denying resource
> allocation BEFORE the system is saturated.

Yes. The goal is to do that within some percentage of the total. On a 1GB
memory system I would like the user resource limits to prevent the last 10
20 pages, the target systems I'm looking at (where it would be most usefull)
there is > 1GB memory with more than 1 CPU (2 - 32) AND more than 10 users.
This puts the burden of maximizing system usage on both the scheduler
(interactive and batch). The batch scheduler has a much easier time since it
can pick and choose jobs to run from start to finish. The interactive scheduler
can only take first come first servered. To compair batch scheduler to an
interactive scheduler it may be more proper to compair a bach scheduler with
a combined inetd/telnetd/login group. I'm not pointing at the kernel process
scheduler whose job is to maximize cpu utilization.

There are process schedulers that also schedule based on memory utilization
too. Linux doesn't have that either, I'm also not fully convinced it needs
one yet; although they can help better utilize memory better (IRIX seems
to do this, but I don't think it does a good job).

> >> However, if you REALLY insist on setting the system up so people can
> >> login and display all sorts of nice OOM error messages on an idle
> >> system, go ahead. I'm not stopping you.
> >
> >If it were idle there would be no OOM error messages :)
>
> Yes there would - you run out of YOUR memory, even when the system has
> plenty to spare.

An IDLE system has free CPU time as well as free memory available. Now if I
am the only user on the system your statement is correct. The system being
targeted will tend not to be idle - there may be availalbe CPU and memory -
but I'm not the only user. I would expect to be one out of 5-20, where most
the other users may be batch jobs.

> >> >>> If you overcommit:
> >> >>> You might support 100 simultaneous connections, but not full saturation
> >> >>> of each server, without aborting some connections or crashing the
> >> >>> server/system.
> >> >>>
> >> >>> Result: some unhappy customers, not so happy management, difficulty
> >> >>> in identifying problems, and definitely no raise.
> >> >>
> >> >>Other customers are unhappy because of long download times, NS (or
> >> >>whatever) crashing, not getting to the site because of (local or remote)
> >> >>network congestion. The "catastrophe" isn't even a blip on the sysadmin's
> >> >>screen.
> >> >
> >> >Yup - long download times since the system keeps crashing, forcing the
> >> >customer to a different site and or different company.
> >>
> >> No, the system never gets as far as crashing, because the inept
> >> sysadmin has already disabled it.
> >
> >not worth answering.
>
> Your approach, with a typical workload, would waste a large percentage
> of the system's resources. Users rarely max out their quotas - but you
> set the system to ensure that a user with a 128Mb quota has 128Mb
> exclusively reserved for their use alone anyway?

Where I work, users frequently max out their quotas, especially when they
are interactive. The majority of the jobs are not interactive. (lots of
VERY small jobs copying data, some edits, and batch monitoring. Simulations
take a long time to setup, they take a lot of memory/cpu, but small interactive
needs.

I exceed my quota almost everytime I attempt to view syslog with vi, unless
it has just been cycled. I try anyway, since it is very convienent when it
works.

> >> >The catastrophe is loss of business, loss of reputation, going out of
> >> >business because you can't keep the customers.
> >>
> >> The customers have already gone somewhere they can actually run
> >> software without OOM errors.
> >
> >not your site - it keeps crashing.
>
> Why? If I have the same level of resources as you, so the same
> workload will succeed without any problems.

No it won't - We force our users to schedule access to run large memory
jobs. You don't. If two of our users ran their job simultaneously, most
likely both will abort. On our Origin, we have a total memory of 64GB.
We allow our users to run a process of up to 32GB. Our batch system
limits the total number of (big) processes to 1. Our interactive users
are limited to 15 minutes of cpu time. When it becomes necessary
we will limit the total number of interactive users (its already limited,
but much higher than I like -- if they all logged in at once then the
system could crash - management choice, definitly not mine). Due to the
way the system IS overcommited (memory and interactive) I have seen
process failures (one crash, inetd, cron, sendmail ... all have failed
at various times). All of these failures have been attributed to
oversubscription of resources.

Special arrangements can be made to run a process up 60GB (we shutdown
all batch processing and run a special queue with only one job at a time).
The user submits his job at any time, it just won't run because the resources
are not available.

> You still don't appear to understand what overcommit means in this
> context.

"overcommit": The sum of all concurrent process virtual memory ranges (in
              pages) exceeds the number of pages in main memory plus swap
              memory.

virtual memory range: That addressable memory requested by a process and
              allocated to the process by the kernel via sbrk, fork, exec,
              and stack. (mmap should be included too, but I don't know
              for sure yet.)

There is still some memory not accounted for here - memory buffers for
I/O, kernel addressing requirements, TCP, driver pools, etc.

The primary usage of memory is covered.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:13 EST