Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Aaron Lu
Date: Fri Aug 11 2023 - 05:28:30 EST


On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote:
> On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
> >
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
>
> I was thinking it might be better to align the update_tg_load_avg() with
> cfs_rq's decay point but that's just my feeling.
>
> Absolutely nothing wrong with the below approach, it's straightforward
> and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
> some data and then send out v2.
>
> Thank you Vincent for all your comments, they're very useful to me.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > {
> > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > + u64 now = cfs_rq_clock_pelt(cfs_rq);
> >
> > /*
> > * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> > if (cfs_rq->tg == &root_task_group)
> > return;
> >
> > - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> > + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > + cfs_rq->last_update_tg_load_avg = now;
> > }
> > }

While collecting data for v2, I find that with this "rate limit updates
to tg->load_avg to at most once per ms", the other two optimizations:
"skip some update_cfs_group() on en/dequeue_entity()" and "Make
tg->load_avg per node" doesn't matter anymore, i.e. as long as
"ratelimit" patch is there, adding those two other optimizations doesn't
improve performance further.

I think this is reasonable, since this "ratelimit" patch reduced updates
to tg->load_avg so much like from some 10 millions during a 5s window to
~25k, the overhead of accessing tg->load_avg is dropped greatly and
improving it by reducing the read side calls and making the counter
per-node doesn't matter anymore.

So I think for v2, I'll go with this "ratelimit" patch alone, but let me
know if you think otherwise.

The tests I've run are listed below.
1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112
cores, 224 threads Intel Sapphire Rapids(SPR);
2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64
cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores,
224 threads Intel SPR;
3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is
set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2).

The test summary is:

postgres_sysbench on SPR
- when nr_thread=25% and 50%, results are in noise range;
- when nr_thread=75% and 100%, "ratelimit" patch can improve transaction
by 12.2% and 21.2%.

hackbench:
- when nr_group=1 and 4, results are in noise range for both ICL and SPR;
- when nr_group=8 and 16, "ratelimit" patch can improve throughput by
6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%.

netperf:
- for tcp_stream mode test, all test results are in noise range for ICL
and SPR;
- for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results
are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL,
"ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR,
the improvement is 162% and 189%.

Full test results are listed below.

base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair:
Block nohz tick_stop when cfs bandwidth in use") as head.
rate: the patch that rate limit updates to tg->load_avg to at most once
per ms; applied on top of "base".
skip: the patch "sched/fair: skip some update_cfs_group() on
en/dequeue_entity()" as in this v1 series; applied on top of "rate".
node: the patch "sched/fair: Make tg->load_avg per node" as in this v1
series + peterz's numa optimizations; applied on top of "skip".

All improvement percents are against "base".

postgres_sysbench on SPR:

25% (all in noise range)
base: 42382±19.8%
rate: 50174±9.5%
skip: 46311±12.0%
node: 48690±13.9%

50% (all in noise range)
base: 67626±1.3%
rate: 67365±3.1%
skip: 68218±4.9%
node: 66878±3.3%

75%
base: 100216±1.2%
rate: 112470±0.1% +12.2%
skip: 113345±0.3% +13.1%
node: 112962±0.1% +12.7%

100%
base: 93671±0.4%
rate: 113563±0.2% +21.2%
skip: 113741±0.2% +21.4%
node: 112793±0.2% +20.4%

hackbench on ICL:
group=1 (all in noise range)
base: 114912±5.2%
rate: 117857±2.5%
skip: 118501±4.3%
node: 119240±7.0%

group=4 (all in noise range)
base: 359902±1.6%
rate: 361685±2.7%
skip: 356354±1.6%
node: 358744±1.7%

group=8
base: 461070±0.8%
rate: 491713±0.3% +6.6%
skip: 490183±1.4% +6.3%
node: 490864±0.5% +6.5%

group=16
base: 309032±5.0%
rate: 378337±1.3% +22.4%
skip: 385637±1.4% +24.8%
node: 375087±0.9% +21.4%

hackbench on SPR:
group=1 (all in noise range)
base: 100768±2.9%
rate: 103134±2.9%
skip: 99642±1.7%
node: 104251±2.5%

group=4 (all in noise range)
base: 413830±12.5%
rate: 378660±16.6%
skip: 367234±13.1%
node: 372122±14.4%

group=8
base: 436124±0.6%
rate: 490787±3.2% +12.5%
skip: 493097±1.0% +13.0%
node: 497336±1.8% +14.0%

group=16
base: 457730±3.2%
rate: 680452±1.3% +48.8%
skip: 679271±0.9% +48.4%
node: 672365±0.5% +46.9%

netperf/udp_rr on ICL
25%
base: 114413±0.1%
rate: 115111±0.0% +0.6%
skip: 115546±0.2% +1.0%
node: 115214±0.3% +0.7%

50% (all in noise range)
base: 86803±0.5%
rate: 86611±0.0%
skip: 86986±0.5%
node: 87466±0.8%

75%
base: 35959±5.3%
rate: 49801±0.6% +38.5%
skip: 50745±1.5% +41.1%
node: 50457±1.5% +40.3%

100%
base: 61951±6.4%
rate: 70224±0.8% +13.4%
skip: 67321±4.3% +8.7%
node: 70591±2.3% +13.9%

netperf/udp_rr on SPR
25% (all in noise range)
base: 104954±1.3%
rate: 107312±2.8%
skip: 107688±1.7%
node: 106024±1.8%

50% (all in noise range)
base: 55394±4.6%
rate: 54940±7.4%
skip: 57829±1.8%
node: 57144±6.4%

75%
base: 13779±3.1%
rate: 36105±1.1% +162%
skip: 35319±6.3% +156%
node: 36764±3.5% +167%

100%
base: 9703±3.7%
rate: 28011±0.2% +189%
skip: 27087±1.3% +179%
node: 27337±2.2% +182%

netperf/tcp_stream on ICL
25% (all in noise range)
base: 43092±0.1%
rate: 42891±0.5%
skip: 43208±0.4%
node: 43044±0.6%

50% (all in noise range)
base: 19278±14.9%
rate: 22369±7.2%
skip: 19375±16.0%
node: 21312±8.2%

75% (all in noise range)
base: 16822±3.0%
rate: 17086±2.3%
skip: 17571±1.8%
node: 16801±2.4%

100% (all in noise range)
base: 18216±0.6%
rate: 18078±2.9%
skip: 18040±1.0%
node: 18064±1.6%

netperf/tcp_stream on SPR
25% (all in noise range)
base: 34491±0.3%
rate: 34886±0.5%
skip: 34894±0.9%
node: 34692±0.6%

50% (all in noise range)
base: 19278±14.9%
rate: 22369±7.2%
skip: 19375±16.0%
node: 21312±8.2%

75% (all in noise range)
base: 16822±3.0%
rate: 17086±2.3%
skip: 17571±1.8%
node: 16801±2.4%

100% (all in noise range)
base: 18216±0.6%
rate: 18078±2.9%
skip: 18040±0.1%
node: 18064±1.6%

Thanks,
Aaron