Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Vincent Guittot
Date: Mon Oct 04 2021 - 12:37:18 EST


On Mon, 4 Oct 2021 at 10:05, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Sep 27, 2021 at 04:17:25PM +0200, Mike Galbraith wrote:
> > On Mon, 2021-09-27 at 12:17 +0100, Mel Gorman wrote:
> > > On Thu, Sep 23, 2021 at 02:41:06PM +0200, Vincent Guittot wrote:
> > > > On Thu, 23 Sept 2021 at 11:22, Mike Galbraith <efault@xxxxxx> wrote:
> > > > >
> > > > > On Thu, 2021-09-23 at 10:40 +0200, Vincent Guittot wrote:
> > > > > >
> > > > > > a 100us value should even be enough to fix Mel's problem without
> > > > > > impacting common wakeup preemption cases.
> > > > >
> > > > > It'd be nice if it turn out to be something that simple, but color me
> > > > > skeptical. I've tried various preemption throttling schemes, and while
> > > >
> > > > Let's see what the results will show. I tend to agree that this will
> > > > not be enough to cover all use cases and I don't see any other way to
> > > > cover all cases than getting some inputs from the threads about their
> > > > latency fairness which bring us back to some kind of latency niceness
> > > > value
> > > >
> > >
> > > Unfortunately, I didn't get a complete set of results but enough to work
> > > with. The missing tests have been requeued. The figures below are based
> > > on a single-socket Skylake machine with 8 CPUs as it had the most set of
> > > results and is the basic case.
> >
> > There's something missing, namely how does whatever load you measure
> > perform when facing dissimilar competition. Instead of only scaling
> > loads running solo from underutilized to heavily over-committed, give
> > them competition. eg something switch heavy, say tbench, TCP_RR et al
> > (latency bound load) pairs=CPUS vs something hefty like make -j CPUS or
> > such.
> >
>
> Ok, that's an interesting test. I've been out intermittently and will be
> for the next few weeks but I managed to automate something that can test
> this. The test runs a kernel compile with -jNR_CPUS and TCP_RR running
> NR_CPUS pairs of clients/servers in the background with the default
> openSUSE Leap kernel config (CONFIG_PREEMPT_NONE) with the two patches
> and no tricks done with task priorities. 5 kernel compilations are run
> and TCP_RR is shutdown when the compilation finishes.
>
> This can be reproduced with the mmtests config
> config-multi-kernbench__netperf-tcp-rr-multipair using xfs as the
> filesystem for the kernel compilation.
>
> sched-scalewakegran-v2r5: my patch
> sched-moveforward-v1r1: Vincent's patch

If I'm not wrong, you refer to the 1st version which scales with the
number of cpu by sched-moveforward-v1r1. We don't want to scale with
the number of cpu because this can create some quite large non
preemptable duration. We want to ensure a fix small runtime like the
last version with 100us

>
>
> multi subtest kernbench
> 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1
> vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean user-80 1518.87 ( 0.00%) 1520.34 ( -0.10%) 1518.93 ( -0.00%)
> Amean syst-80 248.57 ( 0.00%) 247.74 ( 0.33%) 232.93 * 6.29%*
> Amean elsp-80 48.76 ( 0.00%) 48.51 ( 0.52%) 48.70 ( 0.14%)
> Stddev user-80 10.15 ( 0.00%) 9.17 ( 9.70%) 10.25 ( -0.93%)
> Stddev syst-80 2.83 ( 0.00%) 3.02 ( -6.65%) 3.65 ( -28.83%)
> Stddev elsp-80 3.54 ( 0.00%) 3.28 ( 7.28%) 2.40 ( 32.13%)
> CoeffVar user-80 0.67 ( 0.00%) 0.60 ( 9.79%) 0.67 ( -0.93%)
> CoeffVar syst-80 1.14 ( 0.00%) 1.22 ( -7.01%) 1.57 ( -37.48%)
> CoeffVar elsp-80 7.26 ( 0.00%) 6.76 ( 6.79%) 4.93 ( 32.04%)
>
> With either patch, time to finish compilations is not affected with
> differences in elapsed time being well within the noise
>
> Meanwhile, netperf tcp-rr running with NR_CPUS pairs showed the
> following
>
> multi subtest netperf-tcp-rr
> 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1
> vanilla sched-scalewakegran-v2r5 sched-moveforward-v1r1
> Min 1 32388.28 ( 0.00%) 32208.66 ( -0.55%) 31824.54 ( -1.74%)
> Hmean 1 39112.22 ( 0.00%) 39364.10 ( 0.64%) 39552.30 * 1.13%*
> Stddev 1 3471.61 ( 0.00%) 3357.28 ( 3.29%) 3713.97 ( -6.98%)
> CoeffVar 1 8.81 ( 0.00%) 8.47 ( 3.87%) 9.31 ( -5.67%)
> Max 1 53019.93 ( 0.00%) 51263.38 ( -3.31%) 51263.04 ( -3.31%)
>
> This shows a slightly different picture with Vincent's patch having a small
> impact on netperf tcp-rr. It's noisy and may be subject to test-to-test
> variances but it's a mild concern. A greater concern is that across
> all machines, dbench was heavily affected by Vincent's patch even for
> relatively low thread counts which is surprising.
>
> For the same Cascadelake machine both resulst are from, dbench reports
>
> 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1
> vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean 1 15.99 ( 0.00%) 16.20 * -1.27%* 16.18 * -1.16%*
> Amean 2 18.43 ( 0.00%) 18.34 * 0.50%* 22.72 * -23.28%*
> Amean 4 22.32 ( 0.00%) 22.06 * 1.14%* 45.86 *-105.52%*
> Amean 8 30.58 ( 0.00%) 30.22 * 1.18%* 99.04 *-223.88%*
> Amean 16 41.79 ( 0.00%) 41.68 * 0.25%* 161.09 *-285.52%*
> Amean 32 63.45 ( 0.00%) 63.16 * 0.45%* 248.13 *-291.09%*
> Amean 64 127.81 ( 0.00%) 128.50 * -0.54%* 402.93 *-215.25%*
> Amean 128 330.42 ( 0.00%) 336.06 * -1.71%* 531.35 * -60.81%*
>
> That is an excessive impairment. While it varied across machines, there
> was some impact on all of them. For a 1-socket skylake machine to rule
> out NUMA artifacts, I get

As mentioned above, the 1st version which uses
sysctl_sched_min_granularity is not the best solution because it
scales the duration with the number of cpus on the system. The move
forward duration should be fixed and small like the last proposal with
100us. Would be intetesting to see results for this last version
instead

>
> dbench4 Loadfile Execution Time
> 5.15.0-rc1 5.15.0-rc1 5.15.0-rc1
> vanillasched-scalewakegran-v2r5 sched-moveforward-v1r1
> Amean 1 29.51 ( 0.00%) 29.45 * 0.21%* 29.58 * -0.22%*
> Amean 2 37.46 ( 0.00%) 37.16 * 0.82%* 39.81 * -6.26%*
> Amean 4 51.31 ( 0.00%) 51.34 ( -0.04%) 57.14 * -11.35%*
> Amean 8 81.77 ( 0.00%) 81.65 ( 0.15%) 88.68 * -8.44%*
> Amean 64 406.94 ( 0.00%) 408.08 * -0.28%* 433.64 * -6.56%*
> Stddev 1 1.43 ( 0.00%) 1.44 ( -0.79%) 1.54 ( -7.45%)
>
> Not as dramatic but indicates that we likely do not want to cut off
> wakeup_preempt too early a problem.
>
> The test was not profiling times to switch tasks as the overhead
> distorts resules.
>
> --
> Mel Gorman
> SUSE Labs