Re: 4.6.2 frequent crashes under memory + IO pressure

From: Johannes Stezenbach
Date: Sat Jun 25 2016 - 11:50:32 EST


On Thu, Jun 23, 2016 at 08:26:35PM +0900, Tetsuo Handa wrote:
>
> Since you think you saw OOM messages with the older kernels, I assume that the OOM
> killer was invoked on your 4.6.2 kernel. The OOM reaper in Linux 4.6 and Linux 4.7
> will not help if the OOM killed process was between down_write(&mm->mmap_sem) and
> up_write(&mm->mmap_sem).
>
> I was not able to confirm whether the OOM killed process (I guess it was java)
> was holding mm->mmap_sem for write, for /proc/sys/kernel/hung_task_warnings
> dropped to 0 before traces of java threads are printed or console became
> unusable due to the "delayed: kcryptd_crypt, ..." line. Anyway, I think that
> kmallocwd will report it.
>
> > > It is sad that we haven't merged kmallocwd which will report
> > > which memory allocations are stalling
> > > ( http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx ).
> >
> > Would you like me to try it? It wouldn't prevent the hang, though,
> > just print better debug ouptut to serial console, right?
> > Or would it OOM kill some process?
>
> Yes, but for bisection purpose, please try commit 78ebc2f7146156f4 without
> applying kmallocwd. If that commit helps avoiding flood of the allocation
> failure warnings, we can consider backporting it. If that commit does not
> help, I think you are reporting a new location which we should not use
> memory reserves.
>
> kmallocwd will not OOM kill some process. kmallocwd will not prevent the hang.
> kmallocwd just prints information of threads which are stalling inside memory
> allocation request.

First I tried today's git, linux-4.7-rc4-187-g086e3eb, and
the good news is that the oom killer seems to work very
well and reliably killed the offending task (java).
It happened a few times, the AOSP build broke and I restarted
it until it completed. E.g.:

[ 2083.604374] Purging GPU memory, 0 pages freed, 4508 pages still pinned.
[ 2083.611000] 96 and 0 pages still available in the bound and unbound GPU page lists.
[ 2083.618815] make invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2083.629257] make cpuset=/ mems_allowed=0
...
[ 2084.688753] Out of memory: Kill process 10431 (java) score 378 or sacrifice child
[ 2084.696593] Killed process 10431 (java) total-vm:5200964kB, anon-rss:2521764kB, file-rss:0kB, shmem-rss:0kB
[ 2084.938058] oom_reaper: reaped process 10431 (java), now anon-rss:0kB, file-rss:8kB, shmem-rss:0kB

Next I tried 4.6.2 with 78ebc2f7146156f4, then with kmallocwd (needed one manual fixup),
then both patches. It still livelocked in all cases, the log spew looked
a bit different with 78ebc2f7146156f4 applied but still continued
endlessly. kmallocwd alone didn't trigger, with both patches
applied kmallocwd triggered but:

[ 363.815595] MemAlloc-Info: stalling=33 dying=0 exiting=42 victim=0 oom_count=0
[ 363.815601] MemAlloc: kworker/0:0(4) flags=0x4208860 switches=212 seq=1 gfp=0x26012c0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=17984
** 1402 printk messages dropped ** [ 363.818816] [<ffffffff8116d519>] __do_page_cache_readahead+0x144/0x29d
** 501 printk messages dropped **

I'll zip up the logs and send them off-list.


Thanks,
Johannes