Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm_cid

From: Peter Zijlstra
Date: Tue Apr 11 2023 - 07:04:53 EST


On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2a243616f222..f20fc0600fcc 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
> atomic_inc(&mm->mm_count);
> }
> +static inline void smp_mb__after_mmgrab(void)
> +{
> + smp_mb__after_atomic();
> +}
> +
> extern void __mmdrop(struct mm_struct *mm);
> static inline void mmdrop(struct mm_struct *mm)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e0fa4193499..8d410c0dcb39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> sched_info_switch(rq, prev, next);
> perf_event_task_sched_out(prev, next);
> rseq_preempt(prev);
> - switch_mm_cid(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> kmap_local_sched_out();
> prepare_task(next);
> @@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
> *
> * kernel -> user switch + mmdrop() active
> * user -> user switch
> + *
> + * switch_mm_cid() needs to be updated if the barriers provided
> + * by context_switch() are modified.
> */
> if (!next->mm) { // to kernel
> enter_lazy_tlb(prev->active_mm, next);
> @@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
> }
> }
> + /* switch_mm_cid() requires the memory barriers above. */
> + switch_mm_cid(prev, next);
> +
> rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> prepare_lock_switch(rq, next, rf);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
> static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
> {
> + /*
> + * Provide a memory barrier between rq->curr store and load of
> + * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> + *
> + * Should be adapted if context_switch() is modified.
> + */
> + if (!next->mm) { // to kernel
> + /*
> + * user -> kernel transition does not guarantee a barrier, but
> + * we can use the fact that it performs an atomic operation in
> + * mmgrab().
> + */
> + if (prev->mm) // from user
> + smp_mb__after_mmgrab();
> + /*
> + * kernel -> kernel transition does not change rq->curr->mm
> + * state. It stays NULL.
> + */
> + } else { // to user
> + /*
> + * kernel -> user transition does not provide a barrier
> + * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> + * Provide it here.
> + */
> + if (!prev->mm) // from kernel
> + smp_mb();
> + /*
> + * user -> user transition guarantees a memory barrier through
> + * switch_mm().
> + */
> + }
> if (prev->mm_cid_active) {
> mm_cid_put_lazy(prev);
> prev->mm_cid = -1;
>

This is going to be pain wrt.:

https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@xxxxxxxxx

which is already in -next. Also, I recon Nick isn't going to too happy
-- although I recond smp_mb() is better than an atomic op on Power. But
still.

Urgh...

For Nick; the TL;DR is we need an smp_mb() after setting rq->curr and
before calling switch_mm_cid() *IFF* rq->curr->mm changes. Normally this
is provided by switch_mm() itself per actually changing the address
space, except for the whole active_mm/lazy swizzle nonsense, which gives
a few holes still.

The very much longer explanation is upthread here:

https://lkml.kernel.org/r/fdaa7242-4ddd-fbe2-bc0e-6c62054dbde8@xxxxxxxxxxxx