Re: OOM killer firing on 2.6.18 and later during LTP runs

From: Andrew Morton
Date: Sun Nov 26 2006 - 02:31:53 EST


On Sun, 26 Nov 2006 02:25:38 -0500
Dave Jones <davej@xxxxxxxxxx> wrote:

> On Sat, Nov 25, 2006 at 11:11:53PM -0800, Andrew Morton wrote:
> > On Sat, 25 Nov 2006 22:00:45 -0500
> > Dave Jones <davej@xxxxxxxxxx> wrote:
> >
> > > On Sat, Nov 25, 2006 at 01:28:28PM -0800, Andrew Morton wrote:
> > > > On Sat, 25 Nov 2006 13:03:45 -0800
> > > > "Martin J. Bligh" <mbligh@xxxxxxxxxx> wrote:
> > > >
> > > > > On 2.6.18-rc7 and later during LTP:
> > > > > http://test.kernel.org/abat/48393/debug/console.log
> > > >
> > > > The traces are a bit confusing, but I don't actually see anything wrong
> > > > there. The machine has used up all swap, has used up all memory and has
> > > > correctly gone and killed things. After that, there's free memory again.
> > >
> > > We covered this a month or two back. For RHEL5, we've ended up
> > > reintroducing the oom killer prevention logic that we had up until
> > > circa 2.6.10. It seemed that there exist circumstances where
> > > given a little more time, some memory hogging apps will run to completion
> > > allowing other allocators to succeed instead of being killed.
> >
> > I _think_ what you're describing here is a false-positive oom-killing? But
> > Martin appears to be hitting a genuine oom.
>
> what we saw during the rhel5 testing was that yes, the machine _was_ OOM
> *temporarily*, but if instead of killing the task trying to allocate, we
> postponed the killing a few times, it would give other tasks the opportunity
> to complete writeout, or free up memory some other way, allowing the
> allocating process to succeed shortly afterwards.

That would be a false positive then.

In Martin's case he's 100% out of swapspace and has only a few tens of
pages letf mapped into pagetables, so Iassume that all memory is unmapped
swapcache (but that cannot be confirmed from the info which we have). But
it looks like a real oom.

That's not to say that we don't have omm-killer problems.

> > But it does appear that some changes are needed, because lots of things got
> > oom-killed.
> >
> > I think. Maybe not - there's no timestamping in those logs and it is of
> > course possible that we're seeing unrelated ooms which happened a long time
> > apart.
>
> Maybe, but it does sound spookily familiar.
> The last time Larry's patch got floated to lkml it was met with
> "Ah!, but we have new oom killer changes in -git which might solve this".
> We tried them. They didn't.

What's the testcase?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/