Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

From: Peter Zijlstra
Date: Mon May 01 2023 - 04:29:54 EST


On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> >           ^
> > throughput|
> >           |                 X
> >           |               X   X X
> >           |             X         X X
> >           |           X               X
> >           |         X                   X
> >           |       X
> >           |     X
> >           |   X
> >           | X
> >           |
> >           +-----------------.------------------->
> >                             56
> >                                  number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
> Anyway..
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.

I'm thinking WA_BIAS makes this worse...