Re: bug in memcg oom-killer results in a hung syscall in another process in the same cgroup

From: Konstantin Khlebnikov
Date: Tue Jul 12 2016 - 11:59:56 EST


On 12.07.2016 18:35, Shayan Pooya wrote:
With strace, when running 500 concurrent mem-hog tasks on the same
kernel, 33 of them failed with:

strace: ../sysdeps/nptl/fork.c:136: __libc_fork: Assertion
`THREAD_GETMEM (self, tid) != ppid' failed.

Which is: https://sourceware.org/bugzilla/show_bug.cgi?id=15392
And discussed before at: https://lkml.org/lkml/2015/2/6/470 but that
patch was not accepted.

OK, so the problem is that the oom killed task doesn't report the futex
release properly? If yes then I fail to see how that is memcg specific.
Could you try to clarify what you consider a bug again, please? I am not
really sure I understand this report.

It looks like it is just a very easy way to reproduce the problem that
Konstantin described in that lkml thread. That patch was not accepted
and I see no other fixes for that issue upstream. Here is a copy of
his root-cause analysis from said thread:

Whole sequence looks like: task calls fork, glibc calls syscall clone with
CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
Child task gets read-only copy of VM including TLS. Child calls put_user()
to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
fault and it fails because do_wp_page() hits memcg limit without invoking
OOM-killer because this is page-fault from kernel-space. Put_user returns
-EFAULT, which is ignored. Child returns into user-space and catches here
assert (THREAD_GETMEM (self, tid) != ppid), glibc tries to print something
but hangs on deadlock on internal locks. Halt and catch fire.



Yep. Bug still not fixed in upstream. In our kernel I've plugged it with this:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2808,8 +2808,9 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
balance_callback(rq);
preempt_enable();

- if (current->set_child_tid)
- put_user(task_pid_vnr(current), current->set_child_tid);
+ if (current->set_child_tid &&
+ put_user(task_pid_vnr(current), current->set_child_tid))
+ force_sig(SIGSEGV, current);
}

Add Oleg into CC. IIRR he had some ideas how to fix this. =)

--
Konstantin