Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm_cid

From: Mathieu Desnoyers
Date: Tue Apr 11 2023 - 08:38:56 EST


On 2023-04-11 04:46, Peter Zijlstra wrote:
On Fri, Apr 07, 2023 at 07:50:42PM -0400, Mathieu Desnoyers wrote:

Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.

Scenario A) (TSA)+(TMA) (from next task perspective)

CPU0 CPU1

Context switch CS-1 Migrate-from
- store to rq->curr: (N)->(Y) (TSA) - cmpxchg to *pcpu_id to LAZY (TMA)
*** missing barrier ?? *** (implied barrier after cmpxchg)
- prepare_task_switch()
- switch_mm_cid()
- mm_cid_get (next)
- READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If task (Y) store is observed by rcu_dereference(), it means that there is still
an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
fulfills property (1). That observed task will itself eventually need a migrate-from
to be migrated away from that cpu, which fulfills property (2).

If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
that the old state has the LAZY flag set, only one thread will successfully UNSET.

If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
(Y) and do nothing more, which is fine.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (Y) are observed, which would fail property (1) because this would
UNSET a cid which is actively used.

OK, this I'll buy. Let me go stare at this more.

Scenario B) (TSB)+(TMA) (from prev task perspective)

CPU0 CPU1

Context switch CS-1 Migrate-from
- store to rq->curr: (Y)->(N) (TSB) - cmpxchg to *pcpu_id to LAZY (TMA)
*** missing barrier ?? *** (implied barrier after cmpxchg)
- prepare_task_switch()
- switch_mm_cid()
- cid_put_lazy() (prev)
- READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)


This I'm conflicted about, if we're running Y, then how the heck do we
get to setting LAZY in the first place?

For this scenario there must be at least an N->Y->N transition, such
that the first:

if (src_task->mm_cid_active && src_task->mm == mm) {

can observe N and proceed to setting LAZY. But that then leads us to the
scenario above.

Remember that migrate-from does not hold any rq lock. Therefore, it's very much possible that the first check:

if (src_task->mm_cid_active && src_task->mm == mm) {

observes (N), then gets delayed for a while, and then only sets the
LAZY flag when (Y) has been scheduled, leading us to Scenario B).

Thanks,

Mathieu





--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com