Re: process 'stuck' at exit.

From: Mel Gorman
Date: Tue Dec 10 2013 - 20:02:41 EST


On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
>
> > Hmm. Looks like the futex code is somehow stuck in a loop, calling
> > get_user_pages_fast().
> >
> > The futex code itself is apparently so low-overhead that it doesn't
> > show up in your 'perf top' report (which is dominated by all the
> > expensive debug things that get_user_pages_fast() etc ends up doing),
> > but that's the only looping I can see. Perhaps the "goto again" case
> > for transparent huge pages in get_futex_key()? Or the
>
> Cc'ng more folks on that.
>

I just saw this before heading to bed and have not read the thread. I'll
read it in the morning but in the meantime the following might ring a bell
for someone elses investigation or someone more familiar with how futexs
work from end to end.

Was NUMA balancing enabled and was this a NUMA machine?

I ask because of these two patches that are currently in flight

mm: numa: Serialise parallel get_user_page against THP migration mm
fix TLB flush race between migration, and change_protection_range

There are related patches but these two are the most important for what
I have in mind. The two in combination address a problem whereby a write
from one thread can be lost due to a THP migration but it's specific to
automatic NUMA balancing. If the lost update was for a page containing a
futex then the lost write could confuse waiters. The downside is that this
is a bad fit for the problem description in the first mail. A lost update
might result in processes waiting forever on a value that never changes
but offhand it's less clear why it might result in a loop. Unless of
course there is a combination of events that allows for a busy wait on a
value that will never change due to the lost write.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/