Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm_cid

From: Peter Zijlstra
Date: Tue Apr 11 2023 - 04:47:05 EST


On Fri, Apr 07, 2023 at 07:50:42PM -0400, Mathieu Desnoyers wrote:

> Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
>
> Scenario A) (TSA)+(TMA) (from next task perspective)
>
> CPU0 CPU1
>
> Context switch CS-1 Migrate-from
> - store to rq->curr: (N)->(Y) (TSA) - cmpxchg to *pcpu_id to LAZY (TMA)
> *** missing barrier ?? *** (implied barrier after cmpxchg)
> - prepare_task_switch()
> - switch_mm_cid()
> - mm_cid_get (next)
> - READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)
>
> This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
>
> If task (Y) store is observed by rcu_dereference(), it means that there is still
> an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
> fulfills property (1). That observed task will itself eventually need a migrate-from
> to be migrated away from that cpu, which fulfills property (2).
>
> If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
> move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
> an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
> set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
> that the old state has the LAZY flag set, only one thread will successfully UNSET.
>
> If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
> the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
> (Y) and do nothing more, which is fine.
>
> What we are effectively preventing with this Dekker is a scenario where neither LAZY
> flag nor store (Y) are observed, which would fail property (1) because this would
> UNSET a cid which is actively used.

OK, this I'll buy. Let me go stare at this more.

> Scenario B) (TSB)+(TMA) (from prev task perspective)
>
> CPU0 CPU1
>
> Context switch CS-1 Migrate-from
> - store to rq->curr: (Y)->(N) (TSB) - cmpxchg to *pcpu_id to LAZY (TMA)
> *** missing barrier ?? *** (implied barrier after cmpxchg)
> - prepare_task_switch()
> - switch_mm_cid()
> - cid_put_lazy() (prev)
> - READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)
>

This I'm conflicted about, if we're running Y, then how the heck do we
get to setting LAZY in the first place?

For this scenario there must be at least an N->Y->N transition, such
that the first:

if (src_task->mm_cid_active && src_task->mm == mm) {

can observe N and proceed to setting LAZY. But that then leads us to the
scenario above.

(And apparently I ended up doing an N->N transition, which really isn't
*that* interesting :-)