Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory

From: Tetsuo Handa
Date: Wed Apr 06 2016 - 06:29:53 EST


This ext4 livelock case shows a race window which commit 36324a990cf5
("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space")
did not care about.

----------
[ 186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child
[ 186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB
[ 186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[ 245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim
[ 245.578876] file_io.24 D 0000000000000000 0 4715 1 0x00100084
[ 245.584122] ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2
[ 245.589618] 0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec
[ 245.595428] ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000
[ 245.601370] Call Trace:
[ 245.603428] [<ffffffff81587dec>] ? schedule+0x2c/0x80
[ 245.607680] [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0 /* linux-4.6-rc2/fs/jbd2/transaction.c:163 */
[ 245.613586] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 /* linux-4.6-rc2/kernel/sched/wait.c:292 */
[ 245.618074] [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0 /* linux-4.6-rc2/fs/jbd2/transaction.c:191 */
[ 245.623497] [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0
[ 245.628352] [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400 /* linux-4.6-rc2/fs/jbd2/transaction.c:357 */
[ 245.632755] [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0
[ 245.637274] [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190 /* linux-4.6-rc2/fs/jbd2/transaction.c:459 */
[ 245.642298] [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360 /* linux-4.6-rc2/fs/ext4/inode.c:2883 */
[ 245.647035] [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0 /* linux-4.6-rc2/mm/filemap.c:2639 */
[ 245.651651] [<ffffffff8119c440>] ? file_update_time+0xc0/0x110
[ 245.656166] [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0 /* linux-4.6-rc2/mm/filemap.c:2765 */
[ 245.660835] [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340 /* linux-4.6-rc2/fs/ext4/file.c:170 */
[ 245.665292] [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0
[ 245.669604] [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100
[ 245.673904] [<ffffffff81183968>] ? vfs_write+0x98/0x190
[ 245.678174] [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0
[ 245.682376] [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0
[ 245.686845] [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25
----------

ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) {
ret = __generic_file_write_iter(iocb, from) {
written = generic_perform_write(file, from, iocb->ki_pos) {
if (fatal_signal_pending(current)) {
status = -EINTR;
break;
}
status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata) /* ext4_da_write_begin */ { /***** Event1 *****/
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_da_write_credits(inode, pos, len)) /* __ext4_journal_start */ {
__ext4_journal_start_sb(inode->i_sb, line, type, blocks, rsv_blocks) {
jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS, type, line) {
err = start_this_handle(journal, handle, gfp_mask) {
if (!journal->j_running_transaction) {
/*
* If __GFP_FS is not present, then we may be being called from
* inside the fs writeback layer, so we MUST NOT fail.
*/
if ((gfp_mask & __GFP_FS) == 0)
gfp_mask |= __GFP_NOFAIL;
new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); /***** Event2 *****/
if (!new_transaction)
return -ENOMEM;
}
/* We may have dropped j_state_lock - restart in that case */
add_transaction_credits(journal, blocks, rsv_blocks) {
/*
* If the current transaction is locked down for commit, wait
* for the lock to be released.
*/
if (t->t_state == T_LOCKED) { /***** Event3 *****/
wait_transaction_locked(journal); /***** Event4 *****/
return 1;
}
}
}
}
}
}
}
}
}
}

Event1 ... The OOM killer sent SIGKILL to file_io.24(4715) because
file_io.24(4715) was sharing memory with file_io.24(4458).

Event2 ... file_io.24(4715) silently got TIF_MEMDIE using a shortcut
fatal_signal_pending(current) in out_of_memory() because
kmem_cache_zalloc() is allowed to call out_of_memory() due to
__GFP_NOFAIL.

Event3 ... The OOM reaper completed reaping memory used by file_io.24(4458)
and marked file_io.24(4458) as no longer OOM-killable by now.
But since the OOM reaper cleared TIF_MEMDIE from only
file_io.24(4458), TIF_MEMDIE in file_io.24(4715) still remains.

Event4 ... file_io.24(4715) (which used GFP_NOFS | __GFP_NOFAIL) is waiting
for kworker/u128:1(51) (which used GFP_NOFS) to complete wb_workfn.
But both kworker/u128:1(51) (which used GFP_NOFS) and kworker/0:2(285)
(which used GFP_NOIO) cannot make forward progress because the OOM
reaper does not clear TIF_MEMDIE from file_io.24(4715), and the OOM
killer does not select next OOM victim due to TIF_MEMDIE in
file_io.24(4715).

If we remove these shortcuts and set TIF_MEMDIE to all OOM-killed threads
sharing the victim's memory at oom_kill_process() and clear TIF_MEMDIE from
all threads sharing the victim's memory at __oom_reap_task() (or do equivalent
thing using per a signal_struct flag or per a mm_struct flag or a timer), we
wouldn't have hit this race window. Thus, I say again, "I think that removing
these shortcuts is better." unless we add a guaranteed unlocking mechanism
like a timer.

Also, I again want to say that, making current thread's current allocation
request completed by giving TIF_MEMDIE does not guarantee that the current
thread will be able to arrive at do_exit() shortly. It is possible that
the current thread is blocked at unkillable wait if current allocation
succeeded.

Also, is it acceptable to make allocation requests by kworker/u128:1(51) and
kworker/0:2(285) fail because they are !__GFP_FS && !__GFP_NOFAIL when
file_io.24(4715) has managed to allocate memory for journal's transaction
using GFP_NOFS | __GFP_NOFAIL?