Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Aaron Lu
Date: Thu Apr 13 2023 - 10:10:04 EST


On Thu, Apr 13, 2023 at 01:10:47PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> > On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> > >
> > > > I *guess* you might be able to see some contention with hackbench on
> > > > that HSW-EX system with v4.
> > >
> > > Indeed! Notably it seems to be the wakeup from idle that trips it
> > > hardest:
> >
> > Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> > the sake of compacting and when task wakes there, it will have to
> > re-allocate a new cid through mm_get_cid() which needs to acquire
> > mm->cid_lock?
>
> Yup. And I'm thinking it is futile (and counter productive) to strive
> for compactness in this (nr_threads >= nr_cpus) case.
>
> The below on v4 solves the contention I see with hackbench (which runs
> 400 threads which is well above the 144 cpu count on that machine).

I also tested on Icelake and SPR using hackbench and the contention is
also gone :-)

I then tested with sysbench_postgres, the contention is still there on
SPR:

11.63% 11.62% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
7.91% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
3.17% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait

>
> This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
> but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
> perhaps). Also, I would be thinking that's not something that typically
> happens.

Let me see if adding some fuzz helps here.

For sysbench_postgres, the server uses process model while the client
uses thread model and the client(sysbench) started nr_cpus threads so
it's likely didn't trigger the (nr_cpus-1) bar here. I doubt the bar
may have to be as low as 1/2 nr_cpus considering that some of the
cpus are running the server processes.

>
> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
>
> ---
> include/linux/mm_types.h | 8 ++++++++
> kernel/fork.c | 4 +++-
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/sched.h | 2 ++
> 4 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4160ff5c6ebd..598d1b657afa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,7 @@ struct mm_struct {
> * were being concurrently updated by the updaters.
> */
> raw_spinlock_t cid_lock;
> + int cid_saturated;
> /**
> * @pcpu_cid: Per-cpu current cid.
> *
> @@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
> return cid & ~MM_CID_LAZY_PUT;
> }
>
> +static inline void mm_cid_desaturate(struct mm_struct *mm)
> +{
> + if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
> + mm->cid_saturated = 0;
> +}
> +
> /* Accessor for struct mm_struct's cidmask. */
> static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
> {
> @@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
> int i;
>
> raw_spin_lock_init(&mm->cid_lock);
> + mm->cid_saturated = 0;
> for_each_possible_cpu(i)
> *per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
> cpumask_clear(mm_cidmask(mm));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3832bea713c4..a5233e450435 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
> might_sleep();
>
> if (atomic_dec_and_test(&mm->mm_users))
> - __mmput(mm);
> + return __mmput(mm);
> +
> + mm_cid_desaturate(mm);
> }
> EXPORT_SYMBOL_GPL(mmput);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..d5004d179531 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
> if (last_mm_cid == -1)
> return;
>
> + /*
> + * When nr_threads > nr_cpus, there is no point in moving anything
> + * around to keep it compact.
> + */
> + if (mm->cid_saturated) {
> + t->last_mm_cid = -1;
> + return;
> + }
> +
> src_rq = task_rq(t);
> src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> src_cid = READ_ONCE(*src_pcpu_cid);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..6c4af2992e79 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
> }
> raw_spin_lock(&mm->cid_lock);
> cid = __mm_cid_get_locked(mm);
> + if (cid == nr_cpu_ids - 1)
> + mm->cid_saturated = 1;
> raw_spin_unlock(&mm->cid_lock);
> WRITE_ONCE(*pcpu_cid, cid);
> return cid;