Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL

From: Dave Hansen
Date: Mon May 11 2009 - 15:46:03 EST


On Mon, 2009-05-11 at 12:09 -0700, David Rientjes wrote:
> On Mon, 11 May 2009, Dave Hansen wrote:
> > Could you explain a little more about why you think this scenario works
> > for you? Are large contiguous areas of memory pinned by the task
> > getting which you want to get killed? Why wasn't swapping effective
> > against this task? Was the task itself taking up a large portion of
> > total memory?
>
> We frequently do cpuset-constrained oom kills where the lionshare of
> memory on a set of nodes is allocated by a single task or a group of
> threads all sharing the same memory. Swapping is largely effective but at
> this point in the code path it's obviously not making any progress in
> freeing pages.

So, there are three conditions that have to be met before we get to this
behavior:
1. order > PAGE_ALLOC_COSTLY_ORDER
2. __GFP_NOFAIL set
3. no progress being made

The two hunks of this patch really do different things to me.
Personally, I'd split them up.

This patch's first hunk effectively says, "When we have __GFP_NOFAIL
set, it is now OK to OOM tasks for 'order > PAGE_ALLOC_COSTLY_ORDER'".
That's a little bit goofy to me. Either OOMs help large-order allocs or
they don't. __GFP_NOFAIL doesn't change how effective an OOM is.

> So this change fixes two issues:
>
> - __GFP_NOFAIL allocations should not be allowed to return NULL, and
>
> - we should prevent looping endlessly in the page allocator if reclaim
> cannot free the requisite amount of memory.
>
> There is no reason that the oom killer would not be able to kill a task
> that could free 64K of contiguous memory, especially for those that
> mlock() their memory. You could argue that any __GFP_NOFAIL allocation
> above order 3 is insane and should not kill tasks, but that's an issue
> higher up the stack. If you'd like to identify such instances, we could
> emit a warning message here and a stack trace.

Yeah, adding a new warning or enhancing the existing "page allocation
failure" warning would be nice.

Aren't you a bit worried that people will keep adding new 'goto nopage'
cases in here? Would it be more effective to just put a:

if (gfp_mask & __GFP_NOFAIL)
goto restart;

at the very end of the function? That should keep people from screwing
this up in the future a bit better.

Thanks for going to the trouble of trying to sort all this code out a
bit better. I know it is a mess. Could you try and beef up the
descriptions a bit in the next pass? Some of the stuff about where
you're encountering these situations would be really helpful, especially
with 'git blame' these days.

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/