Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Mel Gorman
Date: Thu Dec 10 2020 - 06:37:35 EST


On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
> > I ran this patch with tbench on top of of the schedstat patches that
> > track SIS efficiency. The tracking adds overhead so it's not a perfect
> > performance comparison but the expectation would be that the patch reduces
> > the number of runqueues that are scanned
>
> Thanks for the measurement! I don't play with tbench so may need a while
> to digest the data.
>

They key point is that it appears the idle mask was mostly equivalent to
the full domain mask, at least for this test.

> >
> > tbench4
> > 5.10.0-rc6 5.10.0-rc6
> > schedstat-v1r1 idlemask-v7r1
> > Hmean 1 504.76 ( 0.00%) 500.14 * -0.91%*
> > Hmean 2 1001.22 ( 0.00%) 970.37 * -3.08%*
> > Hmean 4 1930.56 ( 0.00%) 1880.96 * -2.57%*
> > Hmean 8 3688.05 ( 0.00%) 3537.72 * -4.08%*
> > Hmean 16 6352.71 ( 0.00%) 6439.53 * 1.37%*
> > Hmean 32 10066.37 ( 0.00%) 10124.65 * 0.58%*
> > Hmean 64 12846.32 ( 0.00%) 11627.27 * -9.49%*
> > Hmean 128 22278.41 ( 0.00%) 22304.33 * 0.12%*
> > Hmean 256 21455.52 ( 0.00%) 20900.13 * -2.59%*
> > Hmean 320 21802.38 ( 0.00%) 21928.81 * 0.58%*
> >
> > Not very optimistic result. The schedstats indicate;
>
> How many client threads was the following schedstats collected?
>

That's the overall summary for all client counts. While proc-schedstat
was measured every few seconds over all client counts, presenting that
in text format is not easy to parse. However, looking at the graphs over
time, it did not appear that scan rates were consistently lower for any
client count for tbench.

> >
> > 5.10.0-rc6 5.10.0-rc6
> > schedstat-v1r1 idlemask-v7r1
> > Ops TTWU Count 5599714302.00 5589495123.00
> > Ops TTWU Local 2687713250.00 2563662550.00
> > Ops SIS Search 5596677950.00 5586381168.00
> > Ops SIS Domain Search 3268344934.00 3229088045.00
> > Ops SIS Scanned 15909069113.00 16568899405.00
> > Ops SIS Domain Scanned 13580736097.00 14211606282.00
> > Ops SIS Failures 2944874939.00 2843113421.00
> > Ops SIS Core Search 262853975.00 311781774.00
> > Ops SIS Core Hit 185189656.00 216097102.00
> > Ops SIS Core Miss 77664319.00 95684672.00
> > Ops SIS Recent Used Hit 124265515.00 146021086.00
> > Ops SIS Recent Used Miss 338142547.00 403547579.00
> > Ops SIS Recent Attempts 462408062.00 549568665.00
> > Ops SIS Search Efficiency 35.18 33.72
> > Ops SIS Domain Search Eff 24.07 22.72
> > Ops SIS Fast Success Rate 41.60 42.20
> > Ops SIS Success Rate 47.38 49.11
> > Ops SIS Recent Success Rate 26.87 26.57
> >
> > The field I would expect to decrease is SIS Domain Scanned -- the number
> > of runqueues that were examined but it's actually worse and graphing over
> > time shows it's worse for the client thread counts. select_idle_cpu()
> > is definitely being called because "Domain Search" is 10 times higher than
> > "Core Search" and there "Core Miss" is non-zero.
>
> Why SIS Domain Scanned can be decreased?
>

Because if idle CPUs are being targetted and its a subset of the entire
domain then it follows that fewer runqueues should be examined when
scanning the domain.

> I thought SIS Scanned was supposed to be decreased but it seems not on your side.
>

It *should* have been decreased but it's indicating that more runqueues
were scanned with the patch. It should be noted that the scan count is
naturally variable because the depth of each individual search is variable.

> I printed some trace log on my side by uperf workload, and it looks properly.
> To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.
>
> stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
> ========================================================================================
> <idle>-0 [002] d..1 137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
> <idle>-0 [000] d..1 137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
> <idle>-0 [003] d..1 137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
> <idle>-0 [001] d..1 137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3
>

What's the size of the LLC domain on this machine? If it's 4 then this
is indicating that there is little difference between scanning the full
domain and targetting idle CPUs via the idle cpumask.

> stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
> ===================================================================================================
> uperf-560 [000] d..3 138.418494: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] d..3 138.418506: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] d..3 138.418514: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] dN.3 138.418534: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] dN.3 138.418543: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] dN.3 138.418551: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418577: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418600: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418617: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418640: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418652: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418662: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418672: select_task_rq_fair: scanning: 0-3
> uperf-560 [000] d..5 138.418676: select_task_rq_fair: scanning: 0-3
> uperf-561 [003] d..3 138.418693: select_task_rq_fair: scanning: 0-3
> kworker/u8:3-110 [002] d..4 138.418746: select_task_rq_fair: scanning: 0-3
>
> stage 4: scheduler tick comes, update idle cpumask to EMPTY
> ============================================================
> uperf-572 [002] d.h. 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
> uperf-574 [000] d.H2 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
> uperf-565 [003] d.H6 139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
> tmux: server-528 [001] d.h2 139.420572: update_idle_cpumask: set_idle-0, cpumask:
>

It's the timing of the clear that may be problematic. For the
configurations I use, the tick is every 4 milliseconds which is a very
long time in the scheduler for communicating workloads. A simple
round-trip of perf pipe where tasks will be entering/exiting rapidly is
just 0.004 milliseconds on my desktop.

> >
> > I suspect the issue is that the mask is only marked busy from the tick
> > context which is a very wide window. If select_idle_cpu() picks an idle
> > CPU from the mask, it's still marked as idle in the mask.
> >
> That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
> CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
> the default sched_domain_span(sd).
>

I agree that it should never be worse/bigger but if the idle cpu mask is
often the same bits as the sched_domain_span then it's not going to be
any better either.

--
Mel Gorman
SUSE Labs