Re: [RFC PATCH v9 2/2] sched: Fix performance regression introduced by mm_cid

From: Aaron Lu
Date: Thu Apr 20 2023 - 10:22:13 EST


On Thu, Apr 20, 2023 at 09:54:29AM -0400, Mathieu Desnoyers wrote:
> On 2023-04-20 09:35, Aaron Lu wrote:
> [...]
> > > > >
> > > > > Then we clearly have another member of mm_struct on the same cache line as
> > > > > pcpu_cid which is bouncing all over the place and causing false-sharing. Any
> > > > > idea which field(s) are causing this ?
> > > >
> > > > That's my first reaction too but as I said in an earlier reply:
> > > > https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
> > > > I've tried to place pcpu_cid into a dedicate cacheline with no other
> > > > fields sharing a cacheline with it in mm_struct but it didn't help...
> > >
> > > I see two possible culprits there:
> > >
> > > 1) The mm_struct pcpu_cid field is suffering from false-sharing. I would be
> > > interested to look at your attempt to move it to a separate cache line to
> > > try to figure out what is going on.
> >
> > Brain damaged...my mistake, I only made sure its following fields not
> > share the same cacheline but forgot to exclude its preceding fields and
> > turned out it's one(some?) of the preceeding fields that caused false
> > sharing. When I did:
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5eab61156f0e..a6f9d815991c 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -606,6 +606,7 @@ struct mm_struct {
> > */
> > atomic_t mm_count;
> > #ifdef CONFIG_SCHED_MM_CID
> > + CACHELINE_PADDING(_pad1_);
> > /**
> > * @pcpu_cid: Per-cpu current cid.
> > *
> > mm_cid_get() dropped to 0.0x% when running hackbench :-)
>
> Now we are talking! :)
>
> >
> > sched_mm_cid_migrate_to() is about 4% with most cycles spent on
> > accessing mm->mm_users:
> >
> > │ dst_cid = READ_ONCE(dst_pcpu_cid->cid);
> > 0.03 │ mov 0x8(%r12),%r15d
> > │ if (!mm_cid_is_unset(dst_cid) &&
> > 0.07 │ cmp $0xffffffff,%r15d
> > │ ↓ je 87
> > │ arch_atomic_read():
> > │ {
> > │ /*
> > │ * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
> > │ * it's non-inlined function that increases binary size and stack usage.
> > │ */
> > │ return __READ_ONCE((v)->counter);
> > 76.13 │ mov 0x54(%r13),%eax
> > │ sched_mm_cid_migrate_to():
> > │ cmp %eax,0x410(%rdx)
> > 21.71 │ ↓ jle 1d8
> > │ atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
> >
> > With this info, it should be mm_users that caused false sharing for
> > pcpu_cid previously. Looks like mm_users is bouncing.
>
> I suspect that the culprit here is mm_count rather than mm_users. mm_users
> just happens to share the same cache line as mm_count.
>
> mm_count is incremented/decremented with mmgrab()/mmdrop() during
> context switch.
>
> This is likely causing other issues, for instance, the
> membarrier_state field is AFAIR read-mostly, used for
> membarrier_mm_sync_core_before_usermode() to issue core
> sync before every return to usermode if needed.
>
> Other things like mm_struct pgd pointer appear to be likely
> read-mostly variables.
>
> I suspect it's mm_count which should be moved to its own cache line
> to eliminate false-sharing with all the other read-mostly fields
> of mm_struct.
>
> Thoughts ?

Makes sesne, I was wondering where the write side of mm_user is. Let me
see how that goes by placing mm_count aside from other read mostly fields.