Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

From: Aaron Lu
Date: Wed Jun 28 2023 - 04:00:38 EST

Next message: Hao Lan: "Re: [PATCH net-next 00/10] Remove unnecessary (void*) conversions"
Previous message: Xiaoyong Lu: "media: mediatek: vcodec: fix AV1 decode fail for 36bit iova"
In reply to: Deng, Pan: "RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing"
Next in thread: Aaron Lu: "Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jun 27, 2023 at 12:14:37PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 26, 2023 at 01:47:56PM +0800, Aaron Lu wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..31b73e8d9568 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -385,7 +385,9 @@ struct task_group {
> > * it in its own cacheline separated from the fields above which
> > * will also be accessed at each tick.
> > */
> > - atomic_long_t load_avg ____cacheline_aligned;
> > + struct {
> > + atomic_long_t load_avg;
> > + } ____cacheline_aligned_in_smp;
> > #endif
> > #endif
> >
> > This way it can make sure there is no false sharing with load_avg no
> > matter how the layout of this structure changes in the future.
>
> This. Also, ISTR there was a series to split this atomic across nodes;

Yes.

> whatever happend to that,

After collecting more data, the test summary is:
- on SPR, hackbench time reduced ~8% and netperf(UDP_RR/nr_thread=100%)
performance increased ~50%;
- on Icelake, performance regressed about 1%-2% for postgres_sysbench
and hackbench, netperf has no performance change;
- on Cascade Lake, netperf/UDP_RR/nr_thread=50% sees performance
drop ~3%; others have no performance change.

So it is a win for SPR and has small regressions for Icelake and Cascade
Lake. Daniel tested on AMD machines and he also saw some minor
regressions. The win for SPR is most likely due to the per-node
tg->load_avg patch made all CPUs contending on the same cacheline to
contending on two cachelines and that helped SPR because when many CPUs
contending on the same cacheline, SPR is likely to enter the "Ingress
Queue Overflow" state and that is bad for performance. I also did some
experiments on a 2 sockets SPR to place the two per-node tg->load_avg on
the same node and I also saw similar performance improvement.

Based on the test results and the reason why SPR sees improvement, I
didn't continue to push it.

Another thing: after making tg->load_avg per node, update_tg_load_avg()
is strictly local, that's good but update_cfs_group() needs to read all
counters on each node and that means it will still cause the per-node
tg->load_avg bounce across nodes and update_cfs_group() is called very
frequently. I suppose it's where those small regressions come from but
the solution is not obvious.

> and can we still measure an improvement over this with that approach?

Let me re-run those tests and see how things change.

In my previous tests I didn't turn on CONFIG_RT_GROUP_SCHED. To test
this, I suppose I'll turn CONFIG_RT_GROUP_SCHED on and apply this change
here that made tg->load_avg in a dedicated cacheline, then see how
performances change with the "Make tg->load_avg per node" patch. Will
report back once done.

Thanks,
Aaron

Next message: Hao Lan: "Re: [PATCH net-next 00/10] Remove unnecessary (void*) conversions"
Previous message: Xiaoyong Lu: "media: mediatek: vcodec: fix AV1 decode fail for 36bit iova"
In reply to: Deng, Pan: "RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing"
Next in thread: Aaron Lu: "Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]