Re: sched: hang in migrate_swap

From: Rafael David Tinoco
Date: Mon Jun 15 2015 - 15:38:33 EST


Peter, Sasha, coming back to this…

Not that this is happening frequently or I can easily reproduce, but…

> On May14, 2014, at 07:26 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, May 14, 2014 at 02:21:04PM +0400, Kirill Tkhai wrote:
>>
>>
>> 14.05.2014, 14:14, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>:
>>> On Wed, May 14, 2014 at 01:42:32PM +0400, Kirill Tkhai wrote:
>>>
>>>> Peter, do we have to queue stop works orderly?
>>>>
>>>> Is there is not a possibility, when two pair of works queued different on
>>>> different cpus?
>>>>
>>>> kernel/stop_machine.c | 10 ++++++++--
>>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
>>>> index b6b67ec..29e221b 100644
>>>> --- a/kernel/stop_machine.c
>>>> +++ b/kernel/stop_machine.c
>>>> @@ -250,8 +250,14 @@ struct irq_cpu_stop_queue_work_info {
>>>> static void irq_cpu_stop_queue_work(void *arg)
>>>> {
>>>> struct irq_cpu_stop_queue_work_info *info = arg;
>>>> - cpu_stop_queue_work(info->cpu1, info->work1);
>>>> - cpu_stop_queue_work(info->cpu2, info->work2);
>>>> +
>>>> + if (info->cpu1 < info->cpu2) {
>>>> + cpu_stop_queue_work(info->cpu1, info->work1);
>>>> + cpu_stop_queue_work(info->cpu2, info->work2);
>>>> + } else {
>>>> + cpu_stop_queue_work(info->cpu2, info->work2);
>>>> + cpu_stop_queue_work(info->cpu1, info->work1);
>>>> + }
>>>> }
>>>
>>> I'm not sure, we already send the IPI to the first cpu of the pair, so
>>> supposing we have 4 cpus, and get 4 pairs like:
>>>
>>> 0,1 1,2 2,3 3,0
>>>
>>> That would result in IPIs to 0, 1, 2, and 0 again, and since the IPI
>>> function is serialized I don't immediately see a way for this to
>>> deadlock.
>>
>> It's about stop_two_cpus(), I have a distrust about other users of stop task:
>>
>> queue_stop_cpus_work() queues work consequentially:
>>
>> 0 1 2 4
>>
>> stop_two_cpus() may queue:
>>
>> 1 0
>>
>> Looks like, stop thread on 0th and on 1th are waiting for wrong works.
>
> so we serialize stop_cpus_work() vs stop_two_cpus() with an l/g lock.
>
> Ah, but stop_cpus_work() only holds the global lock over queueing, it
> doesn't wait for completion, that might indeed cause a problem.
>
> Also, since its two different cpus queueing, the ordered queue doesn't
> really matter, you can still interleave the all and two sets and get
> into this state.

Do you think __stop_cpus->queue_stop_cpus_work() & stop_two_cpus might be stepping into each other because of this global lock being on held on queuing only (and not completion) ?

In the past I described to Sasha the follow scenario from one of my 3.13 kernels:

> -> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
>
> In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).
>
> Register totally messed up (probably after cpu_relax(), right where
> you were trapped -> after the pause instruction).
>
> my case:
>
> PID: 118 TASK: ffff883fd28ec7d0 CPU: 9 COMMAND: "migration/9"
> ...
> [exception RIP: multi_cpu_stop+0x64]
> RIP: ffffffff810f5944 RSP: ffff883fd2907d98 RFLAGS: 00000246
> RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
> RDX: ffff883fd2907d98 RSI: 0000000000000000 RDI: 0000000000000001
> RBP: ffffffff810f5944 R8: ffffffff810f5944 R9: 0000000000000000
> R10: ffff883fd2907d98 R11: 0000000000000246 R12: ffffffffffffffff
> R13: ffff883f55d01b48 R14: 0000000000000000 R15: 0000000000000001
> ORIG_RAX: 0000000000000001 CS: 0010 SS: 0000
> --- <NMI exception stack> ---
> #4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
>
> 208 } while (curstate != MULTI_STOP_EXIT);
> ---> RIP
> RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx
> ---> CHECKING FOR MULTI_STOP_EXIT
>
> RDX: ffff883fd2907d98 -> does not make any sense
>
> ###
>
> If i'm reading this right,
>
> """
> CPU 05 - PID 14990
>
> do_numa_page
> task_numa_fault
> numa_migrate_preferred
> task_numa_migrate
> migrate_swap (curr: 14990, task: 14996)
> stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
> wait_for_completion
>
> 14990 - CPU05
> 14996 - CPU00
>
> stop_two_cpus:
> multi_stop_data (msdata->state = MULTI_STOP_PREPARE)
> smp_call_function_single (min=cpu2=00, irq_cpu_stop_queue_work, wait=1)
> smp_call_function_single (ran on lowest CPU, 00 for this case)
> irq_cpu_stop_queue_work
> cpu_stop_queue_work(cpu1=05(14996)) # add work (multi_cpu_stop) to cpu 05 cpu_stopper queue
> cpu_stop_queue_work(cpu2=00(14990)) # add work (multi_cpu_stop) to cpu 00 cpu_stopper queue
> wait_for_completion() --> HERE
> """
>
> in my case, checking task structs for tasks scheduled when
> "waiting_for_completion()":
>
> PID 14990 CPU 05 -> PID 14996 CPU 00
> PID 14991 CPU 30 -> PID 14998 CPU 01
> PID 14992 CPU 30 -> PID 14998 CPU 01
> PID 14996 CPU 00 -> PID 14992 CPU 30
> PID 14998 CPU 01 -> PID 14990 CPU 05
>
> AND
>
> 102 2 6 ffff881fd2ea97f0 RU 0.0 0 0 [migration/6]
> 118 2 9 ffff883fd28ec7d0 RU 0.0 0 0 [migration/9]
> 143 2 14 ffff883fd29d47d0 RU 0.0 0 0 [migration/14]
> 148 2 15 ffff883fd29fc7d0 RU 0.0 0 0 [migration/15]
> 153 2 16 ffff881fd2f517f0 RU 0.0 0 0 [migration/16]
>
> THEN
>
> I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
> scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
> running yet.
>
> AND
>
> I don't have any "wait_for_completion" for those "OLDER" migration
> threads (6, 9, 14, 15 and 16). Probably wait_for_completion signalled
> done.completion before racing.

And following this thread’s discussion, and commits bellow:

commit a1d9a3231eac4117cadaf4b6bba5b2902c15a33e
Author: Kirill Tkhai <tkhai@xxxxxxxxx>
Date: Thu Apr 10 17:38:36 2014 +0400

sched: Check for stop task appearance when balancing happens

commit 37e117c07b89194aae7062bc63bde1104c03db02
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Fri Feb 14 12:25:08 2014 +0100

sched: Guarantee task priority in pick_next_task()

commit 38033c37faab850ed5d33bb675c4de6c66be84d8
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Thu Jan 23 20:32:21 2014 +0100

sched: Push down pre_schedule() and idle_balance()

commit 606dba2e289446600a0b68422ed2019af5355c12
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Sat Feb 11 06:05:00 2012 +0100

sched: Push put_prev_task() into pick_next_task()

3.13 kernel still had old logic (before 3.15) - no RETRY_TASK, idle_balance() before pick_next_task(), no deadline scheduler yet - so commit “a1d9a32” does not play a role into this panic. I’m causing ~ 150 stop_two_cpus calls / sec, for task migration, in a 32 fake numa environment, and I am NOT able to reproduce this lockup but, still, the dump is says it is there :\. For 3.13 series this lockup was seen once, no info on other versions.

Any thoughts ?

Thank you

-Rafael Tinoco


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/