RE: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

From: Song Bao Hua (Barry Song)
Date: Wed Feb 16 2022 - 04:19:15 EST




> -----Original Message-----
> From: Barry Song [mailto:21cnbao@xxxxxxxxx]
> Sent: Wednesday, February 16, 2022 10:13 PM
> To: Gautham R. Shenoy <gautham.shenoy@xxxxxxx>
> Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>; yangyicong
> <yangyicong@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Ingo Molnar
> <mingo@xxxxxxxxxx>; Juri Lelli <juri.lelli@xxxxxxxxxx>; Vincent Guittot
> <vincent.guittot@xxxxxxxxxx>; Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>; LKML
> <linux-kernel@xxxxxxxxxxxxxxx>; LAK <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>;
> Dietmar Eggemann <dietmar.eggemann@xxxxxxx>; Steven Rostedt
> <rostedt@xxxxxxxxxxx>; Ben Segall <bsegall@xxxxxxxxxx>; Daniel Bristot de
> Oliveira <bristot@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; ego@xxxxxxxxxxxxxxxxxx;
> Linuxarm <linuxarm@xxxxxxxxxx>; Song Bao Hua (Barry Song)
> <song.bao.hua@xxxxxxxxxxxxx>; Guodong Xu <guodong.xu@xxxxxxxxxx>
> Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
> wake-up path
>
> On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >
> > On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@xxxxxxx>
> wrote:
> > >
> > >
> > > On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
> > >
> > > > > We already figured out that there are no idle CPUs in this cluster.
> So dont
> > > > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > > > If there are no idle CPU/core in the neighbouring cluster, then it does
> make
> > > > > sense to fallback on the current cluster.
> > > >
> > > > What you suggested is exactly the approach we have tried at the first
> beginning
> > > > during debugging. but we didn't gain performance according to benchmark,
> we
> > > > were actually losing. that is why we added this line to stop ping-pong:
> > > > /* Don't ping-pong tasks in and out cluster frequently */
> > > > if (cpus_share_resources(target, prev_cpu))
> > > > return target;
> > > >
> > > > If we delete this, we are seeing a big loss of tbench while system
> > > > load is medium
> > > > and above.
> > >
> > > Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> > > to data ping-ponging across L2 clusters, this heuristic makes sense. I
> > > was thinking of workloads that require lower tail latency, in which
> > > case exploring the larger LLC would have made more sense, assuming
> > > that the larger LLC has an idle core/CPU.
> > >
> > > In the absence of any hints from the workload, like something that
> > > Peter had previous suggested
> > >
> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass
> .net/),
> > > optimizing for cache-access seems to be the right thing to do.
> >
> > Thanks, gautham.
> >
> > Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
> > To me, the case we are discussing seems to be more complicated than
> > applying some scheduling policy on separate tasks by SCHED_BATCH
> > or IDLE.
> >
> > For example, in case we have a process, and this process has 20 threads.
> > thread0-9 might care about cache-coherence latency and want to avoid
> > ping-ponging, and thread10-thread19 might want to have tail-latency
> > as small as possible. So we need some way to tell kernel, "hey, bro, please
> > try to keep thread0-9 still as ping-ponging will hurt them while trying your
> > best to find idle cpu in a wider range for thread10-19". But it seems
> > SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
> > into groups, and is also incapable of telling kernel different groups have
> > different needs.
> >
> > So it seems we want some special cgroups to organize tasks and we can apply
> > some special hints on each different group. for example, putting thread0-9
> > in a cgroup and thread10-19 in another, then:
> > 1. apply "COMMUNCATION-SENSITVE" on the 1st group
> > 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
> > I am not quite sure how to do this and if this can find its way into
> > the mainline.
> >
> > On the other hand, for this particular patch, the most controversial
> > part is those
> > two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
> > like tbench only when system load is high, so I wonder if the approach[1]
> from
> > Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
> > avoid the controversial part.
> > since their patch can also shrink the scanning range while llc load is high.
> >
> > [1]
> https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@xxxxxxxxx/
>
> Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
> problem and make sure there is no performance regression for tbench
> while load is
> high after we remove the code to avoid ping-pong:
>
> 5.17-rc1: vanilla
> rc1 + chenyu: vanilla + chenyu's LLC overload patch
> rc1+chenyu+cls: vanilla + chenyu's patch + my this patchset
> rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
> the code avoiding ping-pong
> rc1+cls: vanilla + my this patchset
>
> tbench running on numa 0 &1:
> 5.17-rc1 rc1 + chenyu
> rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
> Hmean 1 320.01 ( 0.00%) 318.03 * -0.62%*
> 357.15 * 11.61%* 375.43 * 17.32%* 378.44 * 18.26%*
> Hmean 2 643.85 ( 0.00%) 637.74 * -0.95%*
> 714.36 * 10.95%* 745.82 * 15.84%* 752.52 * 16.88%*
> Hmean 4 1287.36 ( 0.00%) 1285.20 * -0.17%*
> 1431.35 * 11.18%* 1481.71 * 15.10%* 1505.62 * 16.95%*
> Hmean 8 2564.60 ( 0.00%) 2551.02 * -0.53%*
> 2812.74 * 9.68%* 2921.51 * 13.92%* 2955.29 * 15.23%*
> Hmean 16 5195.69 ( 0.00%) 5163.39 * -0.62%*
> 5583.28 * 7.46%* 5726.08 * 10.21%* 5814.74 * 11.91%*
> Hmean 32 9769.16 ( 0.00%) 9815.63 * 0.48%*
> 10518.35 * 7.67%* 10852.89 * 11.09%* 10872.63 * 11.30%*
> Hmean 64 15952.50 ( 0.00%) 15780.41 * -1.08%*
> 10608.36 * -33.50%* 17503.42 * 9.72%* 17281.98 * 8.33%*
> Hmean 128 13113.77 ( 0.00%) 12000.12 * -8.49%*
> 13095.50 * -0.14%* 13991.90 * 6.70%* 13895.20 * 5.96%*
> Hmean 256 10997.59 ( 0.00%) 12229.20 * 11.20%*
> 11902.60 * 8.23%* 12214.29 * 11.06%* 11244.69 * 2.25%*
> Hmean 512 14623.60 ( 0.00%) 15863.25 * 8.48%*
> 14103.38 * -3.56%* 16422.56 * 12.30%* 15526.25 * 6.17%*
>
> tbench running on numa 0 only:
>
> 5.17-rc1 rc1 + chenyu
> rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
> Hmean 1 324.73 ( 0.00%) 330.96 * 1.92%*
> 358.97 * 10.54%* 376.05 * 15.80%* 378.01 * 16.41%*
> Hmean 2 645.36 ( 0.00%) 643.13 * -0.35%*
> 710.78 * 10.14%* 744.34 * 15.34%* 754.63 * 16.93%*
> Hmean 4 1302.09 ( 0.00%) 1297.11 * -0.38%*
> 1425.22 * 9.46%* 1484.92 * 14.04%* 1507.54 * 15.78%*
> Hmean 8 2612.03 ( 0.00%) 2623.60 * 0.44%*
> 2843.15 * 8.85%* 2937.81 * 12.47%* 2982.57 * 14.19%*
> Hmean 16 5307.12 ( 0.00%) 5304.14 * -0.06%*
> 5610.46 * 5.72%* 5763.24 * 8.59%* 5886.66 * 10.92%*
> Hmean 32 9354.22 ( 0.00%) 9738.21 * 4.11%*
> 9360.21 * 0.06%* 9699.05 * 3.69%* 9908.13 * 5.92%*
> Hmean 64 7240.35 ( 0.00%) 7210.75 * -0.41%*
> 6992.70 * -3.42%* 7321.52 * 1.12%* 7278.78 * 0.53%*
> Hmean 128 6186.40 ( 0.00%) 6314.89 * 2.08%*
> 6166.44 * -0.32%* 6279.85 * 1.51%* 6187.85 ( 0.02%)
> Hmean 256 9231.40 ( 0.00%) 9469.26 * 2.58%*
> 9134.42 * -1.05%* 9322.88 * 0.99%* 9448.61 * 2.35%*
> Hmean 512 8907.13 ( 0.00%) 9130.46 * 2.51%*
> 9023.87 * 1.31%* 9276.19 * 4.14%* 9397.22 * 5.50%*
>

Sorry, it seems the format is broken. Let me re-post the data.

5.17-rc1: vanilla
rc1 + chenyu: vanilla + chenyu's LLC overload patch
rc1+chenyu+cls: vanilla + chenyu's patch + my this patchset
rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset - the code avoiding ping-pong
rc1+cls: vanilla + my this patchset

tbench running on numa 0&1:
5.17-rc1 rc1 + chenyu rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
Hmean 1 320.01 ( 0.00%) 318.03 * -0.62%* 357.15 * 11.61%* 375.43 * 17.32%* 378.44 * 18.26%*
Hmean 2 643.85 ( 0.00%) 637.74 * -0.95%* 714.36 * 10.95%* 745.82 * 15.84%* 752.52 * 16.88%*
Hmean 4 1287.36 ( 0.00%) 1285.20 * -0.17%* 1431.35 * 11.18%* 1481.71 * 15.10%* 1505.62 * 16.95%*
Hmean 8 2564.60 ( 0.00%) 2551.02 * -0.53%* 2812.74 * 9.68%* 2921.51 * 13.92%* 2955.29 * 15.23%*
Hmean 16 5195.69 ( 0.00%) 5163.39 * -0.62%* 5583.28 * 7.46%* 5726.08 * 10.21%* 5814.74 * 11.91%*
Hmean 32 9769.16 ( 0.00%) 9815.63 * 0.48%* 10518.35 * 7.67%* 10852.89 * 11.09%* 10872.63 * 11.30%*
Hmean 64 15952.50 ( 0.00%) 15780.41 * -1.08%* 10608.36 * -33.50%* 17503.42 * 9.72%* 17281.98 * 8.33%*
Hmean 128 13113.77 ( 0.00%) 12000.12 * -8.49%* 13095.50 * -0.14%* 13991.90 * 6.70%* 13895.20 * 5.96%*
Hmean 256 10997.59 ( 0.00%) 12229.20 * 11.20%* 11902.60 * 8.23%* 12214.29 * 11.06%* 11244.69 * 2.25%*
Hmean 512 14623.60 ( 0.00%) 15863.25 * 8.48%* 14103.38 * -3.56%* 16422.56 * 12.30%* 15526.25 * 6.17%*

tbench running on numa 0 only:
5.17-rc1 rc1 + chenyu rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
Hmean 1 324.73 ( 0.00%) 330.96 * 1.92%* 358.97 * 10.54%* 376.05 * 15.80%* 378.01 * 16.41%*
Hmean 2 645.36 ( 0.00%) 643.13 * -0.35%* 710.78 * 10.14%* 744.34 * 15.34%* 754.63 * 16.93%*
Hmean 4 1302.09 ( 0.00%) 1297.11 * -0.38%* 1425.22 * 9.46%* 1484.92 * 14.04%* 1507.54 * 15.78%*
Hmean 8 2612.03 ( 0.00%) 2623.60 * 0.44%* 2843.15 * 8.85%* 2937.81 * 12.47%* 2982.57 * 14.19%*
Hmean 16 5307.12 ( 0.00%) 5304.14 * -0.06%* 5610.46 * 5.72%* 5763.24 * 8.59%* 5886.66 * 10.92%*
Hmean 32 9354.22 ( 0.00%) 9738.21 * 4.11%* 9360.21 * 0.06%* 9699.05 * 3.69%* 9908.13 * 5.92%*
Hmean 64 7240.35 ( 0.00%) 7210.75 * -0.41%* 6992.70 * -3.42%* 7321.52 * 1.12%* 7278.78 * 0.53%*
Hmean 128 6186.40 ( 0.00%) 6314.89 * 2.08%* 6166.44 * -0.32%* 6279.85 * 1.51%* 6187.85 ( 0.02%)
Hmean 256 9231.40 ( 0.00%) 9469.26 * 2.58%* 9134.42 * -1.05%* 9322.88 * 0.99%* 9448.61 * 2.35%*
Hmean 512 8907.13 ( 0.00%) 9130.46 * 2.51%* 9023.87 * 1.31%* 9276.19 * 4.14%* 9397.22 * 5.50%*

> like rc1+cls, in some
> cases(256, 512 threads on numa0&1), it is even much better.
>
> Thanks
> Barry