Re: [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim

From: Peter Zijlstra
Date: Wed Aug 06 2014 - 04:22:20 EST


On Wed, Aug 06, 2014 at 05:29:48AM +0800, Fengguang Wu wrote:
> Hi Peter,
>
> FYI, we noticed the below changes on
>
> git://internal_merge_and_test_tree devel-athens-alpha-201408042000
> commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix finish_task_switch vs prev_state")
>
> test case: vm-vp-1G/boot/1
>
> 11b4855f1d9f1cc 36932a9e2b1043506d06414fa
> --------------- -------------------------
> 8524 ± 1% +121.2% 18858 ± 0% TOTAL boot-meminfo.SUnreclaim
> 1321 ± 9% +3196.6% 43568 ± 0% TOTAL boot-meminfo.KernelStack
> 8756 ± 0% +29.8% 11363 ± 0% TOTAL boot-slabinfo.num_pages
> 34982 ± 0% +29.5% 45311 ± 0% TOTAL boot-meminfo.Slab
> 535601 ± 0% -9.9% 482375 ± 0% TOTAL boot-meminfo.MemFree
> 116621 ± 0% +9.4% 127548 ± 0% TOTAL boot-slabinfo.num_objs


*blink*.. how does that happen?

For reference, that's the below patch (I've since given it an actual
Changelog, but the patch is the same).

---
Subject: sched: Fix finish_task_switch vs prev_state
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Tue, 29 Jul 2014 11:22:37 +0200

Oleg wondered about the prev_state comment in finish_task_switch().

Aside from it being confusingly worded -- we neither initially
understood the actual problem being described -- we found that we'd
broken the scenario described.

Ever since commit e4a52bcb9a18 ("sched: Remove rq->lock from the first
half of ttwu()") we don't actually acquire rq->lock on wakeups anymore
and therefore holding rq->lock after the switch is no good.

Even if we did, __ARCH_WANT_UNLOCKED_CTXSW was already broken, because
it explicitly didn't hold the rq->lock anymore.

We could fix things by placing a full barrier between the prev->state
read and the next->on_cpu = 0 store, seeing how the remote wakeup code
waits for ->on_cpu to become false before proceeding with the wakeup
(so as to avoid having the task running on two cpus at the same time),
however full barriers are expensive.

Instead we read prev->state before the context switch and propagate it
unto finish_task_switch. This trivially solves the problem without
adding extra (and costly) memory barriers.

I'm not sure why we've never seen crashes due to this, it appears
entirely possible.

Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: John Stultz <john.stultz@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Cc: Dave Jones <davej@xxxxxxxxxx>
Cc: Andrey Ryabinin <a.ryabinin@xxxxxxxxxxx>
Cc: Sasha Levin <sasha.levin@xxxxxxxxxx>
Cc: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Reported-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Link: http://lkml.kernel.org/r/20140729092237.GU12054@xxxxxxxxxx
---
kernel/sched/core.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struc
* finish_task_switch - clean up after a task-switch
* @rq: runqueue associated with task-switch
* @prev: the thread we just switched away from.
+ * @prev_state: the state of @prev before we switched away from it.
*
* finish_task_switch must be called after the context switch, paired
* with a prepare_task_switch call before the context switch.
@@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struc
* with the lock held can cause deadlocks; see schedule() for
* details.)
*/
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static void
+finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state)
__releases(rq->lock)
{
struct mm_struct *mm = rq->prev_mm;
- long prev_state;

rq->prev_mm = NULL;

- /*
- * A task struct has one reference for the use as "current".
- * If a task dies, then it sets TASK_DEAD in tsk->state and calls
- * schedule one last time. The schedule call will never return, and
- * the scheduled task must drop that reference.
- * The test for TASK_DEAD must occur while the runqueue locks are
- * still held, otherwise prev could be scheduled on another cpu, die
- * there before we look at prev->state, and then the reference would
- * be dropped twice.
- * Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
- */
- prev_state = prev->state;
vtime_task_switch(prev);
finish_arch_switch(prev);
perf_event_task_sched_in(prev, current);
@@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail(
{
struct rq *rq = this_rq();

- finish_task_switch(rq, prev);
+ finish_task_switch(rq, prev, 0);

/*
* FIXME: do we need to worry about rq being invalidated by the
@@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct tas
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
+ /*
+ * A task struct has one reference for the use as "current".
+ * If a task dies, then it sets TASK_DEAD in tsk->state and calls
+ * schedule one last time. The schedule call will never return, and
+ * the scheduled task must drop that reference.
+ *
+ * We must observe prev->state before clearing prev->on_cpu (in
+ * finish_lock_switch), otherwise a concurrent wakeup can get prev
+ * running on another CPU and we could race with its RUNNING -> DEAD
+ * transition, and then the reference would be dropped twice.
+ *
+ * We avoid the race by observing prev->state while it is still
+ * current.
+ */
+ long prev_state = prev->state;

prepare_task_switch(rq, prev, next);

@@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct tas
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
- finish_task_switch(this_rq(), prev);
+ finish_task_switch(this_rq(), prev, prev_state);
}

/*

Attachment: pgp9gRtWr3XAL.pgp
Description: PGP signature