Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Mel Gorman
Date: Fri Dec 11 2020 - 17:53:07 EST


On Fri, Dec 11, 2020 at 11:19:05PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote:
> > One bug is in __select_idle_core() though. It's scanning the SMT mask,
> > not select_idle_mask so it can return an idle candidate that is not in
> > p->cpus_ptr.
>
> D'0h.. luckily the benchmarks don't hit that :-)
>

Yep, neither do mine for the most part which is why I ran it as-is but
eventually someone would complain that sched_setscheduler was being
ignored.

Trick is whether a failed check should continue searching for an idle core
or terminate early and just pick an allowed idle CPU. I tend to favour
the latter but it's hard to predict what sort of reasonable workloads
would be affected.

> > There are some other potential caveats.
> >
> > This is a single pass so when test_idle_cores() is true, __select_idle_core
> > is used to to check all the siblings even if the core is not idle. That
> > could have been cut short if __select_idle_core checked *idle_cpu ==
> > 1 and terminated the SMT scan if an idle candidate had already been found.
>
> So I did that on purpose, so as to track the last/most-recent idle cpu,
> with the thinking that that cpu has the higher chance of still being
> idle vs one we checked earlier/longer-ago.
>
> I suppose we benchmark both and see which is liked best.
>

I originally did something like that on purpose too but Vincent called
it out so it is worth mentioning now to avoid surprises. That said, I'm
at the point where anything SIS-related simply needs excessive exposure
because no matter what it does, someone is unhappy with it.

> > Second downside is related. If test_idle_cpus() was true but no idle
> > CPU is found then __select_idle_core has been called enough to scan
> > the entire domain. In this corner case, the new code does *more* work
> > because the old code would have failed select_idle_core() quickly and
> > then select_idle_cpu() would be throttled by SIS_PROP. I think this will
> > only be noticable in the heavily overloaded case but if the corner case
> > hits enough then the new code will be slower than the old code for the
> > over-saturated case (i.e. hackbench with lots of groups).
>
> Right, due to scanning siblings, even if the first inspected thread is
> not idle, we scan more.
>

Yep.

> > The third potential downside is that the SMT sibling is not guaranteed to
> > be checked due to SIS_PROP throttling but in the old code, that would have
> > been checked by select_idle_smt(). That might result in premature stacking
> > of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> > find multiple idle candidates, it will not pick the targets SMT sibling
> > if it is idle like select_idle_smt would have.
> >
> > That said, I am skeptical that select_idle_smt() matters all that often.
>
> This, I didn't really believe in it either.
>

Good because I think any benefit from select_idle_smt is so marginal
that it should be ignored if the full scan is simpler overall.

> The benchmarks I started are mostly noise, with a few wins for
> TCP_STREAM and UDP_RR around the 50% mark. Although I should run
> longer variants to make sure.

So far I have one benchmark from one machine. It's tbench again because
it's a reasonable communicating workload that is trivial to vary the
level of utilisation.

80-cpu CascadeLake, 2 sockets, HT enabled
tbench4
5.10.0-rc6 5.10.0-rc6 5.10.0-rc6
baseline-v1r1 singlescan-v1r1 singlescan-v1r3
Hmean 1 503.90 ( 0.00%) 505.19 * 0.26%* 499.20 * -0.93%*
Hmean 2 980.80 ( 0.00%) 975.15 * -0.58%* 983.79 * 0.31%*
Hmean 4 1912.37 ( 0.00%) 1883.25 * -1.52%* 1923.76 * 0.60%*
Hmean 8 3741.47 ( 0.00%) 3568.66 * -4.62%* 3690.60 * -1.36%*
Hmean 16 6552.90 ( 0.00%) 6549.97 * -0.04%* 6478.37 * -1.14%*
Hmean 32 10217.34 ( 0.00%) 10266.66 * 0.48%* 10291.60 * 0.73%*
Hmean 64 13604.93 ( 0.00%) 11240.88 * -17.38%* 12045.87 * -11.46%*
Hmean 128 21194.11 ( 0.00%) 21316.08 * 0.58%* 21868.39 * 3.18%*
Hmean 256 21163.19 ( 0.00%) 20989.14 * -0.82%* 20831.20 * -1.57%*
Hmean 320 20906.29 ( 0.00%) 21024.11 * 0.56%* 20756.88 * -0.71%*
Stddev 1 3.14 ( 0.00%) 1.17 ( 62.93%) 1.05 ( 66.61%)
Stddev 2 4.44 ( 0.00%) 3.72 ( 16.35%) 2.20 ( 50.56%)
Stddev 4 9.09 ( 0.00%) 18.67 (-105.32%) 3.66 ( 59.71%)
Stddev 8 12.87 ( 0.00%) 18.96 ( -47.31%) 11.90 ( 7.58%)
Stddev 16 8.34 ( 0.00%) 8.77 ( -5.22%) 36.25 (-334.84%)
Stddev 32 27.05 ( 0.00%) 20.90 ( 22.74%) 28.57 ( -5.61%)
Stddev 64 720.66 ( 0.00%) 20.12 ( 97.21%) 32.10 ( 95.55%)
Stddev 128 17.49 ( 0.00%) 52.33 (-199.22%) 137.68 (-687.25%)
Stddev 256 57.17 ( 0.00%) 18.87 ( 67.00%) 38.98 ( 31.81%)
Stddev 320 29.87 ( 0.00%) 46.49 ( -55.64%) 31.48 ( -5.39%)

5.10.0-rc6 5.10.0-rc6 5.10.0-rc6
baseline-v1r1singlescan-v1r1singlescan-v1r3
Duration User 9771.18 9435.64 9353.29
Duration System 34224.28 33829.01 33802.34
Duration Elapsed 2218.87 2218.87 2218.69

baseline is roughly what's in tip for 5.11-rc1 with patches 1-2 from my
series like you had.

singlescan-v1r1 is your patch

singlescan-v1r3 is your patch with my "untested" patch on top that
enforces p->cpus_ptr and short-cuts corner cases.

Few points of interest

1. 64 clients regresses. With 64 clients, this is roughly the point
where test_idle_cores() returns false positives and we hit the worst
corner cases. Your patch regresses 17%, mine is only a marginal
improvement and still a regression slower.

2. Variations are all over the place. Your patch is great at low
utilisation and stabilises overall. After that, your milage varies a lot

3. The system CPu usage summarised over the entire workload drops quite
a bit. Whether it's your patch or minor changes on top, there is less
work being done within the kernel overall

A wider battery of tests is still running and a second set is queued
that adds the debugging schedstats but they take ages to run.

I'm currently on "holidays" so response time will be variable :P

--
Mel Gorman
SUSE Labs