Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

From: Qais Yousef
Date: Thu Feb 04 2021 - 10:31:57 EST

Next message: Arnd Bergmann: "[PATCH] kallsyms: fix nonconverging kallsyms table with lld"
Previous message: Ben Widawsky: "Re: [PATCH 03/14] cxl/mem: Find device capabilities"
In reply to: Valentin Schneider: "Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion"
Next in thread: Valentin Schneider: "Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 02/03/21 18:59, Valentin Schneider wrote:
> On 03/02/21 17:23, Qais Yousef wrote:
> > On 01/27/21 19:30, Valentin Schneider wrote:
> >> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> >> unearthed one more outstanding issue. This doesn't even involve
> >> migrate_disable(), but rather affinity changes and execution of the stopper
> >> racing with each other.
> >>
> >> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> >> errors at each level) is:
> >>
> >> Initial conditions:
> >> victim.cpus_mask = {CPU0, CPU1}
> >>
> >> CPU0 CPU1 CPU<don't care>
> >>
> >> switch_to(victim)
> >> set_cpus_allowed(victim, {CPU1})
> >> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
> >> switch_to(stopper/0)
> >> // e.g. CFS load balance
> >> move_queued_task(CPU0, victim, CPU1);
> >> switch_to(victim)
> >> set_cpus_allowed(victim, {CPU0});
> >> task_rq_unlock();
> >> migration_cpu_stop(dest_cpu=CPU1)
> >
> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
> >
>
> Right
>
> >> task_rq(p) != rq && pending
> >> kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> >>
> >> switch_to(stopper/1)
> >> migration_cpu_stop(dest_cpu=CPU1)
> >
> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
> >
>
> Nein! This is a retriggering of the "current" stopper (triggered by
> set_cpus_allowed(victim, {CPU1})), see the tail of that
>
> else if (dest_cpu < 0 || pending)
>
> branch in migration_cpu_stop(), is what I'm trying to hint at with that
>
> task_rq(p) != rq && pending

Okay I see. But AFAIU, the work will be queued in order. So we should first
handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?

So I see migration_cpu_stop() running 3 times

1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
3. because of retrigger of '1' on CPU0

Thanks

--
Qais Yousef

Next message: Arnd Bergmann: "[PATCH] kallsyms: fix nonconverging kallsyms table with lld"
Previous message: Ben Widawsky: "Re: [PATCH 03/14] cxl/mem: Find device capabilities"
In reply to: Valentin Schneider: "Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion"
Next in thread: Valentin Schneider: "Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]