Avoiding *mandatory* overcommit...

From: Linda Walsh (law@sgi.com)
Date: Thu Mar 30 2000 - 16:51:22 EST


Horst von Brand wrote:
> Not in itself, the problem is that if you don't ever want to overcommit
> anything you must know exactly how much memory each activity could use, in
> the very worst case.

---
	???  But this is known.  If a process mallocs 1 Meg of memory, you
commit (reserve) one meg of physical memory/swap.  If a process does an exec,
you 'commit' (reserve) exactly all of the COW pages.  In the case of stack,
the user or sysadmin sets a minimum stack to run with -- lets say 128K or
1M -- whichever.  That is the amount reserved.  Then when I write reliable
process, that limits it's stack usage to < the reserved value -- every call
will return a reliable indicator "pass or fail".  Never will I just be
'killed' because the system couldn't map memory it had already allocated
to me.  
"Peter T. Breuer" wrote:
> 
> "A month of sundays ago Linda Walsh wrote:"
> >       Well -- that's sorta the point -- Everything from 'atd' to 'vi'
> > would need to be rewritten to 'touch' pages of alloc'ed memory.  If you want
> 
> Where do you get this from? Just change the malloc they use in libc.
> This used to be commonly done to protect netscape from itself
> (LD_PRELOAD = gnumalloc.o). This is and always has been trivial.
---
	That solves nothing.  When you go to touch the memory, if you are
at an OOM condition, your process faults because memory can't be mapped.  You
haven't returned an error that my program can deal with -- you just killed
it.  It also doesn't solve COW pages after a fork. 

> Overcommitment is fine. If you don't want it, reserve your own backing > swap space for your stack. That's all. Then you can't be murdered by > the kernel because the kernel will always be able to page you out. --- ??? Without backing store reserved on stack, the kernel can always page you out. The problem is when trying to "touch your malloc space" or touch all of your COW pages after a fork (?!?) -- you don't get a nice reliable 'out of memory', or NULL pointer back from malloc, you just get killed because the kernel couldn't map memory for you at run time. Swapping has nothing to do with anything.

Matija Nalis wrote: > Yes. It is easy to force (just overload malloc(3) with small library wrapper > in /etc/ld.so.preload). Or wrap around brk/sbrk(2). --- Doesn't solve overcommit of COW pages from fork. The wrapper would only guarantee that you die immediately after a malloc rather than returning a NULL pointer because it couldn't be fulfilled.

> Problem is, overcommit is just one thing that could lead to OOM. > We can discuss that you can lower the chances for OOM by disabling > overcommit, implementing efficient per-user-VM quota, fixing stack size, > reserving some memory only for kernel and/or root, etc, but it will just > lower the chances for OOM, not eliminate it. --- I'm not trying to lower chances or eliminate OOM. OOM is an independent subject. This is a discussion about how to control what happens on OOM. If you have allowed overcommit, the results are less predictable for any app in the system. If it is _possible_to_disable_ overcommit, the mode of failure becomes easier to predict. Any given app can choose to deal with a NULL pointer from malloc, and any given app can choose to deal with a failure of a fork and in both cases react appropriately. Any given app could reserve (perhaps at 'ld' time) sufficient stack space for itself if it knows its own behavior. Then I can write well behaved apps that deal with insufficient resources in a sane manner -- not just randomly have my app or another 'die'.

> OOM will still happen, and we need some handling when it hits us. killing > random process is at least guaranteed to be fair as far as it goes, but > fairness is not necessarily the most sane thing to do in all cases... --- Fair? What would be fair is to allow processes that are written to handle low resources without dying to continue to run. Those programs that are written not to handle those cases would be the first to 'die'. Survival of the fittest? :-) Look at it as Darwinistic program evolution. Better build programs survive. Poorly written programs that don't check error codes die.

To argue that such a paradigm would requiring recoding of many existing apps is simply arguing to support bad programming.

Horst von Brand wrote: > Nope, I talking about the _kernel's_ memory usage here. > > > The only thing a program has to "predict" is a maximum stack > > size -- which is physically reserved as a *minimum* at run time. All > > other requests for memory can be denied with an error code. > > And crash if you run out of stack? That was supposed to be forbidden... --- If a program is *well designed* (possibly a foreign concept, I know), it should be able to predict that at most it will use X-bytes of stack. If that is reserved at runtime there will be no crash for running out of stack. Crashing by running out of stack isn't forbidden for those programs that have no design that specifies maximum stack usage. > If you want to go that way, let the kernel do the dirty work. That is > probably easier than fixing several thousand programs. --- Fixing several thousand programs? We are only talking about programs that don't check error codes they get back from system calls. Are their that many poorly written programs that expect infinite resources? As for stack -- that can be defaulted to a sufficiently large space to allow for most programs. The only programs that would die would be those that use more than the pre-estimated amount of stack ONLY in an OOM situation.

> "(allowing for failures to occur predictably)" for whom? Not for me, as the > final user, I just see my programs crash at random or not starting at all. --- ??? crash at random? If the program doesn't check error codes, yell at the author. If it doesn't start because you are out of memory, that doesn't make sense to you? I suppose being out of processes doesn't make sense to you either.

> If they wander into some OOM-killer that is halfways decently done, they > will be killed _less_ (overcommitment will let them go further, perhaps > even go through) and _more_ predictably, i.e., the ones killed will be > probably those that are memory hogs. --- X and netscape are the biggest memory "hogs" on my system. I sure hope "X" doesn't randomly get picked. But if 'X' is well designed to handle mallocs that fail and has sufficient stack reserved, it will be *ROBUST*. Robust programs are *good* programs. Robust programs should never be targeted by a random killer. It rewards those who practice bad programming techniques, while creating disincentive for writing robust ones: "Well, I'll just write my program to expect resources to always be there, cuz, it'll be random which process the OOM-killer hits anyway -- and maybe I'll usually be running as root and will be started up at system boot time, so I'm protected OOM-killer's selection algorithm".

> How is the state after killing a memory hog "completely unstrusted > (non-predictable)"? --- Biggest memory hog on my system is 'X'. I'd say killing off my interface generally results in a random number of programs dying and a random amount of work being lost. But if you are _allowed_to_disable_ overcommit, and if you write 'X' to be *robust*, it won't.

Grendel wrote: > Somebody earlier > mentioned that in such situation IRIX goes to a deadlock - is it a sensible > behavior? --- The C2 specification says that when you can no longer record audit events, the system should prevent auditable events from occurring. Depending on what is being audited, Halting the system or deadlock would effective meet that requirement.

> Now that I think about it, the compartmented mode on our average PC is quite > improbable :)), but with a cluster of PCs it can be quite possible to do and > very flexible. But what if the Linux kernel supported full virtualization of > all of its sharable resources? That is, it would create full VMs with > virtual network cards, CPUs, pure virtual memory, block devices? --- That could be one way. To create isolation, much of the available system info (paging activity, disk usage, etc.) would have to be restricted, but for B1 security, elimination of covert channels isn't required. But imagine a B1 system where all programs are checked for operational correctness and handling of error conditions (like insufficient resources). Then if you allow non-overcommit, you could say you have a robust system. But if all your programs are robust in checking error codes and reserving their necessary stack and the kernel *lies* (low integrity) about the resources it has allocated, all of your attempts to design a robust program are for naught. :-(

> Heh :)))), yeah - that's a bit of a comedy :)). But, what if the Mr. Prez's > process isn't killed but simply put on hold and Mr. Prez sees a nice cute > box popping up on his screen saying "We're sorry - a temporary shortage of > memory caused your process to stop. Please wait patiently." --- Is the cute box an icon of Monica Lewinski? :-)

> The process is > put on the wait queue until the memory becomes available. To prevent > starvation, the system should be preemptive, of course. The processes in > separate priority groups would be serviced in parallel mode until memory > comes short, then the higher priorities win of course. The processes in the > same priority group would have to be serialized for the access to memory if > the need arises. How about that scenario? No killing, some patience > sometimes :)) --- Absolutely. But you can't put the icon display into the kernel. That's an application level feature -- since any given app could choose to spill to disk -- like an editor: Can't store all of files in memory, it creates it's own spill files. Such behavior was common in days when memory was limited to <640K.

Marco Colombo wrote: > If you use plain malloc(), you're not allowed to think you have any > space guaranteed. It's bad programming. --- ?! mlock locks pages in memory. I just want to malloc (from the man page):

malloc() allocates size bytes and returns a pointer to the allocated memory. The memory is not cleared. ... For calloc() and malloc(), the value returned is a pointer to the allocated memory, which is suitably aligned for any kind of variable, or NULL if the request fails.

It's not bad programming to expect that malloc will allocate memory. It's the documented interface. It is the documented interface to return NULL if it cannot allocate the memory. With overcommit, the kernel has broken this model because the memory isn't really allocated -- just the process's top of heap pointer has been moved. My contention is that this is not ANSI-C compliant.

> If you need guaranteed "space" > (memory) use another kernel interface, such as mlock(). I'm not saying > the current interface is perfect. I'm just saying that overcommitting > is not the problem. You don't need to turn overcommiting off. You > need you use a better interface than malloc() to get "safe" memory. --- Not if we claim to be ANSI compliant.

> For stack grow, maybe we need some way to tell the kernel: > "never page-out my stack, and reserve me this space...". --- Paging out is not the issue. The issue is not having enough combined memory and swap space. OOM doesn't simply mean out of physical memory -- it means out of swap space as well. For this discussion most people are using "memory" to mean "memory+swap".

> Applications should be able to bypass kernel management of their address > space. But this should be done on a per-app base. --- I agree with this statement, but it isn't relevant to the discussion topic. Richard Gooch wrote: > Yeah, right now the kernel "automatically" reserves 1 page for the > user stack anyway (simply because the process main() function will > require it). > > Reserving some minimum number of pages has the problem of determining, > ahead of time, how many should be reserved. Are you suggesting this be > done on a case-by-case basis? --- Can have 3 values. One at link time. One default for the system set by the system administrator (system policy) and one set by the user as default for their processes. Take the 'max' of those values. In the case of none of those values being set, you get current 1-page allocation behavior -- which isn't a problem unless you run O-O-M.

This means users/admins/programmers have choices to provide different minimums than the default 1-page. None of them *has* to be provided with the caveat that under OOM conditions, some SIGNAL (SIGNMEM?) will be sent to your process. If ignored, process is suspended until mapping can be done. If in default, process that ran out of memory -- the one that didn't correctly specify it's maximum stack usage, run by the user who didn't specify large enough minimums for their processes, run on a system where the sysadmin didn't set a high enough systemwide minimum, the process will be killed. ---

> Still seems a shaky foundation. To make this work, you would at least > need a signal that is sent when you reach some high watermark of used > stack pages, but still have enough reserved so that the application > can still do something about it. SIGSTACKDANGER or some such --- Perhaps a some general 'ulimit' controls for user memory usage (soft and hard) would be desirable -- but that's a step beyond the basics of reserved memory accounting and enforcement.

> It would be even better if this mechanism wasn't restricted to stack > pages. Rather, just make it global, for all the process pages. Simply > reserve a specified number of pages, and send a signal when a high > watermark is reached. --- That's a related, but separate issue. Right now the immediate problem is returning "success" indicators for malloc and fork when in reality there may be no more space in memory/swap. The process can't determine -- only the kernel knows if it is 'overcommitting' and if it is out of space. I'm simply wanting the _option_ to have the kernel do the bookkeeping and "tell the truth". My contention is that the kernel should always "tell the truth" -- but that a sysadmin can lie to the kernel and allocate 1T of virtual swap. This way the kernel can always act 'proper' in how it handles it's resources and any users that want current behavior just add a vswap line to their /etc/fstab (just like 'shm' users had to add a line in 2.3).

-linda

-- Linda A Walsh | Trust Technology, Core Linux, SGI law@sgi.com | Voice: (650) 933-5338

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:27 EST