Re: Avoiding OOM on overcommit...?

From: Jesse Pollard (pollard@tomcat.admin.navo.hpc.mil)
Date: Fri Mar 24 2000 - 16:21:40 EST


 Sandy Harris <sandy@storm.ca>:
> Alan Cox wrote:
> >
> > > The explanation is not the issue. The issue is having programs that
> > > act correctly (check malloc() returns, don't follow null pointers,
> > > etc.) fail in ways that are quite difficult for their programmers
> > > to prevent.
> >
> > Indeed. And its completely valid to kill them off. Right now nobody
> > has a serious commercial requirement for non-overcommit on Linux.
>
> I'm seriously confused by this thread.
>
> If I understand correctly, programs can malloc() memory, get back a
> valid pointer, then segfault later when they try to use the memory.
> This strikes me as quite obviously Wrong.
>
> I can see no question about whether this is a problem. To me, it clearly
> is, and a ghastly one at that. This breaks the basic behaviour of
> important system calls, making memory allocation unreliable.
>
> It is not clear to me why anyone would implement overcommit in the
> first place or, once the obvious problem is pointed out, why there
> should be discussion of anything other than how to fix the blunder.
>
> On the other hand, I see several people, some of whom I know are
> competent, defending the existing behaviour.
>
> Methinks not all those folk are clueless. More likely, I've missed
> something basic here. Would someone please explain it?
>
> Would it make everyone happy to have tunable parameters for this?
> Using R for RAM, S for swap, max Commitable is:
>
> C = aR + bS
>
> and sys admin sets a, b. Default is perhaps a = 1, b = .5 to keep
> paging to reasonable levels, but you might have a = b = 10 on one
> system to allow heavy overcommit and a = 1, b = 0 on another for
> minimum swapping.

This is a bit oversimplified - lets expand it somewhat:

   let p = number of concurrent processes permitted
       r = minimum guaranteed resident (in R) pages (resident set size quota)
       v = virtual pages that a user is permitted to use (virtual quota)

Each user gets some value of v. On some systems r is a global limit
that applies to all users.

Let r be a global limit assigned to all processes - then:

        p = R/r (each process is guarantted this amount)

p can now be considered a process quota for the entire system. When p gets
        exceeded you get the "can't fork - no process slots" message.

Note: There are other formulas for this. p could be arbitrarily set, and
        r computed fro R/p. This has the problem that a process may be
        guaranteed only one page, which is an unusable system.

        A second note - since r is the guaranteed resident, the process
        may actually have more. If the system needs resident pages for
        a different (new) processs, then all processes exceeding this
        amount could be trimmed (paged out) to get resident pages for
        the new process. This does require periodic processing (daemon
        level, not kernel) to balance the resident pages fairly among
        all processes. This could be done with several ways - enhance
        CPU utilization, attempt to minimize page fault rate... different
        policies, different goals.

Let k be the number of pages that root will use for daemons and kernel
        operation

Let x be the current amount of memory given to users. This is computed
by:
        x = sum (v of all users logged in) + k

note: x <= R+S or there is an overcommit. I'll ignore mmap for
                        this discussion because this appears to affect
                        r and S (the users datafile could be considered an
                        extension of swap, the page mapping done affects the
                        number of resident pages).

How do these new parameters fit in:
        r is a performance tool that reduces the page fault rate and hence
           a load on I/O. It also improves CPU utilization for a process, and
           general overhead.
        v is a user limit that must be shared among all processes belonging
           to the user. This imposes several problems, which is what the
           occasionally loud discussions are about.

These additional parameters expand the formula to a closer-to-reality
form used in large systems. None of this goes into how to implement them.
Most systems have weight factors that are added in based on the user,
the resource being delt with (resident vs swap may get different weights)
CPU accounting, etc.

The more information available to the kernel the easier it is to handle user
processes. Some systems' linker (ld) put (or allow to be put) a request for
an initial stack size. This allows exec to specificly allocate that amount
of memory. If the users stack exceeds this the users program is aborted (not
fun when it is the shell, but it happens).

Memory is allocated to a process in sbrk, fork, exec, stack page fault.

Most of the discussion is about fork, with references to stack page fault.

A fork makes a copy of a processes page table for the new process. At the
same time, each data page is marked for copy-on-write (COW). Copy on write
is a performance optimization that eliminates the overhead of having to
copy each page (even those stored in swap) of a potentially large program.
The symantics for fork still call for the copy, hence the copy on write,
a read of a page is equivalent, no matter which process does it, therefore
it doesn't need to be copied on read. If the page is modified, then a new
physical resident page must be allocated to the page table of whichever
process (parent or child).

NOTE: the fork provides the appearance of a duplicate of the entire process
AND the assumption that any or all pages in the new process may be written
to (minus the executable text). If the child or parent process returns from
the routine that executed the fork and proceeds to call a different function,
the stack is modified. Since it is part of the COW pages, it must be
duplicated. This calls for another change in resident set, for one or the other
process (which ever did it first). In all cases the old page is no longer
marked COW, and new page is a copy of the old one.

IF the fork subtracts the total number of pages (size of the parent) from
the current value of v (users quota) then the resulting value of v may be < 0.
This would abort the users process.

This is the simple method of handling user memory quotas, and the most
painfull.

The objections are:
  1. the system may not be out of resources
  2. the users process could execute exec, which completely replaces the
     memory copied with a new executable image, then the entire allocation
     for the duplicate is unnecessary.
  3. The new process could write data out and exit. No modification, but access
     some memory. This could cause the parent to duplicate the pages since
     the parent may be updating them (it doesn't stop just because it did
     a fork).

This was the origin of the "vfork" branch. The BSD vfork did a fork, but
put the parent process to sleep until the child process terminated, or did
an exec. This made COW unnecessary, and much closer to a thread - except that
a thread has a separate stack, and is still tied into its' parent process.

I consider objection 2 and 3 to be valid, and want some way to alleviate them.
I think #1 is irrelevent because management has already informed the user
(via quota v) that More Will Not Be Given. (takes an act of management to
change that decision).

The one I suggested:

        let the child fork, but have a limit of 5-10 pages. IF the child:
        terminates then resources are released
        does an exec then new memory limits are generated.
        IF the child uses up the minimal resource, then full resources
                must be allocated, and the potential for v < 0 occurs again.

This doesn't really work since the parent process could start modifying
data pages immediately. The COW pages would/could be duplicated very quickly
exhausting the 5-10 page limit. If the pages weren't marked COW, then the
parent would be able to modify pages out from under the child (and vice versa).
This is more like a thread and is not appropriate for most uses of fork.

One system I ran across (I think it was HP, but perhaps not) tried to solve
this by putting the new process at a significantly higher priority than the
parent process. This had several advantages:

1. The new process started processing very quickly - its' time slice would
   give it enough time to exec a new image quickly (at which time the
   priority dropped back down).
2. The delay on the parent process due to the priority, prevented it from
   modifying data out from under the child.

This works on a uniprocessor system, but less well on multi-cpu system. The
second CPU might take the parent process and start executing it. Now it is
back to the same case as original.

This approach might still work IF:
 a. the scheduler knew not to resume a parent process for one time slice after
    a fork (what is one time slice for instance, the fork may happen right at
    the end of the parents' slice
 b. the child still needs a 5-10 page minimal quota (for stack) and the rest
    of the conditions from the suggestion.
 c. the parents' delay is canceled if the child terminates or executes an
    exec.

Problem here is that the parent could end up waiting for almost 2 time
slices.

Possible workarounds for the 2 timeslice delay would be:

a. do scheduling in units of 1/2 the current amount. This would allow
   the fork to set a 1/2 slice delay, which at worst case would be one
   time slice.
b. don't know - there must be one.

None of the current solutions look really good. They CAN guarantee the
system doesn't go OOM, simply because the total draconian limits do
not allow it. Users would go out of resources while the system did have
them.

The determination of which is better is, in the end, up to the owners
of the system.

I want the option.

BTW, I beleve the "commercial requirement" is just having someone
offer to pay for it via a support and development contract. (I can't
do that, unfortunately)
---------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:14 EST