Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

From: Tim Chen
Date: Thu Feb 17 2022 - 13:00:29 EST


On Wed, 2022-02-16 at 18:00 +0800, Yicong Yang wrote:
> On 2022/2/16 17:19, Song Bao Hua (Barry Song) wrote:
> >
> > tbench running on numa 0&1:
> > 5.17-rc1 rc1 + chenyu rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
> > Hmean 1 320.01 ( 0.00%) 318.03 * -0.62%* 357.15 * 11.61%* 375.43 * 17.32%* 378.44 * 18.26%*
> > Hmean 2 643.85 ( 0.00%) 637.74 * -0.95%* 714.36 * 10.95%* 745.82 * 15.84%* 752.52 * 16.88%*
> > Hmean 4 1287.36 ( 0.00%) 1285.20 * -0.17%* 1431.35 * 11.18%* 1481.71 * 15.10%* 1505.62 * 16.95%*
> > Hmean 8 2564.60 ( 0.00%) 2551.02 * -0.53%* 2812.74 * 9.68%* 2921.51 * 13.92%* 2955.29 * 15.23%*
> > Hmean 16 5195.69 ( 0.00%) 5163.39 * -0.62%* 5583.28 * 7.46%* 5726.08 * 10.21%* 5814.74 * 11.91%*
> > Hmean 32 9769.16 ( 0.00%) 9815.63 * 0.48%* 10518.35 * 7.67%* 10852.89 * 11.09%* 10872.63 * 11.30%*
> > Hmean 64 15952.50 ( 0.00%) 15780.41 * -1.08%* 10608.36 * -33.50%* 17503.42 * 9.72%* 17281.98 * 8.33%*
> > Hmean 128 13113.77 ( 0.00%) 12000.12 * -8.49%* 13095.50 * -0.14%* 13991.90 * 6.70%* 13895.20 * 5.96%*
> > Hmean 256 10997.59 ( 0.00%) 12229.20 * 11.20%* 11902.60 * 8.23%* 12214.29 * 11.06%* 11244.69 * 2.25%*
> > Hmean 512 14623.60 ( 0.00%) 15863.25 * 8.48%* 14103.38 * -3.56%* 16422.56 * 12.30%* 15526.25 * 6.17%*
> >
>
> Yes I think it'll also benefit for the cluster's conditon.
>
> But 128 threads seems like a weired point that Chen's patch on 5.17-rc1 (without this series) causes degradation,
> which in Chen's tbench test it doesn't cause that much when the 2 * cpu number == threads[*]:
>

>From the data, it seems like Chen Yu's patch benefits the overloaded condition (as expected) while
the cluster scheduling has benefit most at the low end (also expected). It is nice that
by combining these two approaches we can get the most benefit.

Chen Yu's patch has a hard transition to stop search for idle CPU at about 85% utilization.
So we may be hitting that knee and we may benefit from not stopping search completely
but reducing number of CPUs searched, as Peter pointed out.

Tim

> case load baseline(std%) compare%( std%)
> loopback thread-224 1.00 ( 0.17) +2.30 ( 0.10)
>
> [*] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@xxxxxxxxx/
>
> > tbench running on numa 0 only:
> > 5.17-rc1 rc1 + chenyu rc1+chenyu+cls rc1+chenyu+cls-pingpong rc1+cls
> > Hmean 1 324.73 ( 0.00%) 330.96 * 1.92%* 358.97 * 10.54%* 376.05 * 15.80%* 378.01 * 16.41%*
> > Hmean 2 645.36 ( 0.00%) 643.13 * -0.35%* 710.78 * 10.14%* 744.34 * 15.34%* 754.63 * 16.93%*
> > Hmean 4 1302.09 ( 0.00%) 1297.11 * -0.38%* 1425.22 * 9.46%* 1484.92 * 14.04%* 1507.54 * 15.78%*
> > Hmean 8 2612.03 ( 0.00%) 2623.60 * 0.44%* 2843.15 * 8.85%* 2937.81 * 12.47%* 2982.57 * 14.19%*
> > Hmean 16 5307.12 ( 0.00%) 5304.14 * -0.06%* 5610.46 * 5.72%* 5763.24 * 8.59%* 5886.66 * 10.92%*
> > Hmean 32 9354.22 ( 0.00%) 9738.21 * 4.11%* 9360.21 * 0.06%* 9699.05 * 3.69%* 9908.13 * 5.92%*
> > Hmean 64 7240.35 ( 0.00%) 7210.75 * -0.41%* 6992.70 * -3.42%* 7321.52 * 1.12%* 7278.78 * 0.53%*
> > Hmean 128 6186.40 ( 0.00%) 6314.89 * 2.08%* 6166.44 * -0.32%* 6279.85 * 1.51%* 6187.85 ( 0.02%)
> > Hmean 256 9231.40 ( 0.00%) 9469.26 * 2.58%* 9134.42 * -1.05%* 9322.88 * 0.99%* 9448.61 * 2.35%*
> > Hmean 512 8907.13 ( 0.00%) 9130.46 * 2.51%* 9023.87 * 1.31%* 9276.19 * 4.14%* 9397.22 * 5.50%*
> >
> > > like rc1+cls, in some
> > > cases(256, 512 threads on numa0&1), it is even much better.
> > >
> > > Thanks
> > > Barry