Re: [PATCH 0/3] OOM detection rework v4

From: Tetsuo Handa
Date: Fri Mar 11 2016 - 11:49:41 EST


Michal Hocko wrote:
> On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > (Posting as a reply to this thread.)
> > >
> > > I really do not see how this is related to this thread.
> >
> > All allocating tasks are looping at
> >
> > /*
> > * If we didn't make any progress and have a lot of
> > * dirty + writeback pages then we should wait for
> > * an IO to complete to slow down the reclaim and
> > * prevent from pre mature OOM
> > */
> > if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > return true;
> > }
> >
> > in should_reclaim_retry().
> >
> > should_reclaim_retry() was added by OOM detection rework, wan't it?
>
> What happens without this patch applied. In other words, it all smells
> like the IO got stuck somewhere and the direct reclaim cannot perform it
> so we have to wait for the flushers to make a progress for us. Are those
> stuck? Is the IO making any progress at all or it is just too slow and
> it would finish actually. Wouldn't we just wait somewhere else in the
> direct reclaim path instead.

As of next-20160311, CPU usage becomes 0% when this problem occurs.

If I remove

mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
mm: use watermark checks for __GFP_REPEAT high order allocations
mm: throttle on IO only when there are too many dirty and writeback pages
mm-oom-rework-oom-detection-checkpatch-fixes
mm, oom: rework oom detection

then CPU usage becomes 60% and most of allocating tasks
are looping at

/*
* Acquire the oom lock. If that fails, somebody else is
* making progress for us.
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
}

in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).