[PATCH] mm,oom: fix oom invocation issues

From: Roman Gushchin
Date: Wed May 17 2017 - 11:26:59 EST


During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel
(allocate is a simple process which calls malloc() and faults
each page in a infinite loop):
[ 25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[ 25.725658] allocate cpuset=/ mems_allowed=0
[ 25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
<cut>
[ 25.768293] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 25.768860] [ 121] 0 121 25672 133 50 3 0 0 systemd-journal
[ 25.769530] [ 156] 0 156 11157 197 22 3 0 -1000 systemd-udevd
[ 25.770206] [ 206] 0 206 13896 99 29 3 0 -1000 auditd
[ 25.770822] [ 227] 0 227 11874 124 27 3 0 0 systemd-logind
[ 25.771494] [ 229] 81 229 11577 146 28 3 0 -900 dbus-daemon
[ 25.772126] [ 231] 997 231 27502 102 25 3 0 0 chronyd
[ 25.772731] [ 233] 0 233 61519 5239 85 3 0 0 firewalld
[ 25.773345] [ 238] 0 238 123495 529 74 4 0 0 NetworkManager
[ 25.773988] [ 265] 0 265 25117 231 52 3 0 -1000 sshd
[ 25.774569] [ 271] 0 271 6092 154 17 3 0 0 crond
[ 25.775137] [ 277] 0 277 11297 93 26 3 0 0 systemd-hostnam
[ 25.775766] [ 284] 0 284 1716 29 9 3 0 0 agetty
[ 25.776342] [ 285] 0 285 2030 34 9 4 0 0 agetty
[ 25.776919] [ 302] 998 302 133102 2578 58 3 0 0 polkitd
[ 25.777505] [ 394] 0 394 21785 3076 45 3 0 0 dhclient
[ 25.778092] [ 444] 0 444 36717 312 74 3 0 0 sshd
[ 25.778744] [ 446] 0 446 15966 223 36 3 0 0 systemd
[ 25.779304] [ 447] 0 447 23459 384 47 3 0 0 (sd-pam)
[ 25.779877] [ 451] 0 451 36717 316 72 3 0 0 sshd
[ 25.780450] [ 452] 0 452 3611 315 11 3 0 0 bash
[ 25.781107] [ 492] 0 492 513092 473645 934 5 0 0 allocate
[ 25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[ 25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
<cut>
[ 25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
[ 25.818821] allocate cpuset=/ mems_allowed=0
[ 25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[ 25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 25.820549] Call Trace:
[ 25.820733] dump_stack+0x63/0x82
[ 25.820961] dump_header+0x97/0x21a
[ 25.820961] ? security_capable_noaudit+0x45/0x60
[ 25.820961] oom_kill_process+0x219/0x3e0
[ 25.820961] out_of_memory+0x11d/0x480
[ 25.820961] pagefault_out_of_memory+0x68/0x80
[ 25.820961] mm_fault_error+0x8f/0x190
[ 25.820961] ? handle_mm_fault+0xf3/0x210
[ 25.820961] __do_page_fault+0x4b2/0x4e0
[ 25.820961] trace_do_page_fault+0x37/0xe0
[ 25.820961] do_async_page_fault+0x19/0x70
[ 25.820961] async_page_fault+0x28/0x30
<cut>
[ 25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[ 25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

After some investigations I've found some issues:

1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
if a process with a pending SIGKILL was calling out_of_memory(),
it was always immediately selected as a victim.
But now, after some changes, it's not always a case.
If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
task_will_free_mem() will return false, and a new
victim selection logic will be started.

This actually happens if a userspace pagefault causing an OOM.
pagefault_out_of_memory() is called in a context of a faulting
process after it has been selected as OOM victim (assuming, it
has), and killed. With some probability (there is a race with
oom_reaper thread) this process will be passed to the oom reaper
again, or an innocent victim will be selected and killed.

2) We clear up the task->oom_reaper_list before setting
the MMF_OOM_SKIP flag, so there is a race.

3) We skip the MMF_OOM_SKIP flag check in case of
an sysrq-triggered OOM.

To address these issues, the following is proposed:
1) If task is already an oom victim, skip out_of_memory() call
from the pagefault_out_of_memory().

2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
process to the oom_reaper list. If it's already set, do nothing.
Do not rely on tsk->oom_reaper_list value.

3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

Signed-off-by: Roman Gushchin <guro@xxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
Cc: kernel-team@xxxxxx
Cc: linux-mm@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
---
mm/oom_kill.c | 33 +++++++++++++--------------------
1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c630c76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
* the task has MMF_OOM_SKIP because chances that it would release
* any memory is quite low.
*/
- if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+ if (tsk_is_oom_victim(task)) {
if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
goto next;
- goto abort;
+ if (!is_sysrq_oom(oc))
+ goto abort;
}

/*
@@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
schedule_timeout_idle(HZ/10);

- if (attempts <= MAX_OOM_REAP_RETRIES)
- goto done;
-
-
- pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- debug_show_all_locks();
-
-done:
- tsk->oom_reaper_list = NULL;
-
- /*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ if (attempts > MAX_OOM_REAP_RETRIES) {
+ pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ debug_show_all_locks();
+ }

/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
@@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
+ tsk->oom_reaper_list = NULL;
}
spin_unlock(&oom_reaper_lock);

@@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
if (!oom_reaper_th)
return;

- /* tsk is already queued? */
- if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+ if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
return;

get_task_struct(tsk);
@@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;

+ if (tsk_is_oom_victim(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
--
2.7.4