Re: Out of Memory in v. 2.1

Colin Plumb (colin@nyx.net)
Thu, 8 Oct 1998 06:56:33 -0600 (MDT)


I was thinking about out of memory situations and had a few ideas
about solutions. I'm not sure that they're effective, so I'm asking
if anyone is aware of any research along similar lines to confirm or
refute my hunches.

To recap, the problem is that there are times that a Unix kernel is
called upon to promise large memory allocations when it is unlikely
that there will ever be a need to make good on the promise.

This is sort of like a bank doesn't keep enough cash on hand to
satisfy every depositor closing an account at once, but as long
as it has enough to satisfy everyone in practice, there are significant
advantages to the more "optimistic" practice.

Anyway, fork() is the primary culprit, since it is defined to give
a private writeable copy of a very large amout of memory (the entire
process address space), but this is almost never called upon.
99% of the time, the new child touches a page or two of stack and
then immediately exec()s, which deletes the copy. .9% of
the time, the child shares the data read-only with the parent.

Because of this, the copy-on-write implementation is popular. The
kernel marks the page as read-only and lets both processes access it.
If one process attempts to write to the page, the copy is made then.
Because this happens to so few pages, the fact that this "lazy"
technique is more expensive than "eager" copying of a single page does
not prevent it from being significantly faster overall.

It also lets bloated monsters like netscape fork off children without
needing hundreds of megabytes of swap soace to held in reserve even
though it will never be called upon.

The problem is that while fork() may fail cleanly, the later writes do
not have any such mechanism defined. The kernel is deadlocked until
it makes room somehow. A similar problem arises if a process uses
lots and lots of stack. We'd like to be able to start more
processes than (swap space/10 MB), but also allow any individual
process to allocate a 10 MB stack. The problem only arises if
*all* processes want 10 MB of stack.

So we want to permit "optimistic" overbooking of memory, but limited
so that it will never be a problem in practice.

The first thing I think is worth doing is keeping a couple of
"promises outstanding" counters. One is for things that are expected
to be called in. brk(), for example, is the back end to malloc(), and
malloc()ed data is usually called upon in short order. A reasonable
amount of stack space per process is also possible.

The second is for outstanding promises that are *not* expected to
be called in. Things like everything over 64K in stack space,
COW pages resulting from fork(), and so on.

If anything would take allocated memory plus the first class of
promised memory past the system memory limit, the allocation should fail.

But rather than imposing any sort of limit on the second class, I think
that just reserving some space to make good on unexpected calls is
correct. This is sort of like the memory reserved for atomic
allocations in the kernel, except that it comes from RAM+Swap.

Like the kernel atomic allocation reserve, we want "failable"
allocations to start failing before all of memory is exhausted so
there will always be space for "non-failable" allocations.

That pool can be fixed in size, a proportion of memory+swap, or,
like banks' cash reserve requirements are set, a percentage of
outstanding-but-not-expected promises.

In any case, any failable opportunity that would take the system into
the "danger zone" is not permitted. (You could define a tighter
danger zone for root if desired.)

This should prevent OOM problems for non-pathological situations, with
a tunable definition of pathological. But if something is
pathological, the reserved memory+swap can be used up, so the kernel
needs to "default" on a promise and kill a process.

The trick is to choose a "badly behaved" process so that if someone is
trying to attack the system they'll be the one to suffer.

Well, by definition, a "badly behaved" process is one that pushes the
kernel into the danger zone by calling in unexepcted promises and making
lots of non-failable allocations. And we're worried about what's
happening that's bad *now*, so the history before we entered the
danger zone is not relevant. A process that claimed a bit more
stack than usual but settled down and let the system get out
of the danger zone is not a bad citizen.

So this provides a simple metric of badness which can be used to
measure who should get killed: of the "danger zone" memory allocated,
who has used the most? Most applications quickly "stabilize" and stop
allocating any, so if this has happened before the danger zone was most
recently entered, their badness is zero and they will not be killed.
But *somebody* is driving the system into the danger zone, and that
process or processes will have a non-zero badness.

If it's a single process, it'll stand out. If it's an intelligent
attack, an attacker can arrange to pre-fork enough attack clients
(since once in the danger zone, further forks will fail) that
each needs only allocate one page, but still the system can find
the culprit.

Long-running daemons will almost invariably have a badness of 0
and not be killed, because they were part of the system's pre-attack
stable state.

Note that it would be a good idea to *also* have SIGDANGER,
by default ignored, but a process can register a handler.
SIGDANGER is broadcast when the danger zone is entered, and
nice applications will prune their memory pools. (The
allocation which failed might sleep and automatically retry,
as well - that's an option.)

Anyway, this is all handwaving and speculation.
Does anyone care to express an opinion on this?

-- 
	-Colin

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/