Re: [tip: sched/core] sched/eevdf: Curb wakeup-preemption

From: K Prateek Nayak
Date: Mon Aug 21 2023 - 06:39:22 EST


Hello Peter,

Sorry for being late to the party but couple of benchmarks are unhappy
(very!) with eevdf, even with this optimization. I'll leave the results
of testing on a dual socket 3rd Generation EPYC System (2 x 64C/128T)
running in NPS1 mode below.

tl;dr

- Hackbench with medium load, tbench when overloaded, and DeathStarBench
are not a fan of EEVDF so far :(

- schbench, when system overloaded, sees great benefit in 99th%ile
latency, but that is expected since the deadline is fixed to
(vruntime + base_slice) but base_slice_ns is equal to the legacy
min_granularity_ns in all cases. Some cases of unixbench see a good
benefit too.

- Others seem perf neutral.

On 8/17/2023 8:40 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 63304558ba5dcaaff9e052ee43cfdcc7f9c29e85
> Gitweb: https://git.kernel.org/tip/63304558ba5dcaaff9e052ee43cfdcc7f9c29e85
> Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> AuthorDate: Wed, 16 Aug 2023 15:40:59 +02:00
> Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CommitterDate: Thu, 17 Aug 2023 17:07:07 +02:00
>
> sched/eevdf: Curb wakeup-preemption
>
> Mike and others noticed that EEVDF does like to over-schedule quite a
> bit -- which does hurt performance of a number of benchmarks /
> workloads.
>
> In particular, what seems to cause over-scheduling is that when lag is
> of the same order (or larger) than the request / slice then placement
> will not only cause the task to be placed left of current, but also
> with a smaller deadline than current, which causes immediate
> preemption.
>
> [ notably, lag bounds are relative to HZ ]
>
> Mike suggested we stick to picking 'current' for as long as it's
> eligible to run, giving it uninterrupted runtime until it reaches
> parity with the pack.
>
> Augment Mike's suggestion by only allowing it to exhaust it's initial
> request.
>
> One random data point:
>
> echo NO_RUN_TO_PARITY > /debug/sched/features
> perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
>
> 3,723,554 context-switches ( +- 0.56% )
> 9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% )
>
> echo RUN_TO_PARITY > /debug/sched/features
> perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
>
> 2,556,535 context-switches ( +- 0.51% )
> 9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% )

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

base: tip:sched/core at commit 752182b24bf4 ("Merge tag
'v6.5-rc2' into sched/core, to pick up fixes")

eevdf: tip:sched/core at commit c1fc6484e1fb ("sched/rt:
sysctl_sched_rr_timeslice show default timeslice after
reset")

eevdf_curb: tip:sched/core at commit 63304558ba5d ("sched/eevdf:
Curb wakeup-preemption")

o Benchmark Results

* - Regression
^ - Improvement

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.51) 1.02 [ -1.69]( 1.89) 1.03 [ -2.54]( 2.42)
2-groups 1.00 [ -0.00]( 1.63) 1.05 [ -4.68]( 2.04) 1.04 [ -3.75]( 1.25) *
4-groups 1.00 [ -0.00]( 1.80) 1.07 [ -7.47]( 2.38) 1.07 [ -6.81]( 1.68) *
8-groups 1.00 [ -0.00]( 1.43) 1.06 [ -6.22]( 1.52) 1.06 [ -6.43]( 1.32) *
16-groups 1.00 [ -0.00]( 1.04) 1.01 [ -1.27]( 3.44) 1.02 [ -1.55]( 2.58)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
1 1.00 [ 0.00]( 0.49) 1.01 [ 0.97]( 0.18) 1.01 [ 0.52]( 0.06)
2 1.00 [ 0.00]( 1.94) 1.02 [ 2.36]( 0.63) 1.02 [ 1.62]( 0.63)
4 1.00 [ 0.00]( 1.07) 1.00 [ -0.19]( 0.86) 1.01 [ 0.76]( 1.19)
8 1.00 [ 0.00]( 1.41) 1.02 [ 1.69]( 0.22) 1.01 [ 1.48]( 0.73)
16 1.00 [ 0.00]( 1.31) 1.04 [ 3.72]( 1.99) 1.05 [ 4.67]( 1.36)
32 1.00 [ 0.00]( 5.31) 1.04 [ 3.53]( 4.29) 1.05 [ 4.52]( 2.21)
64 1.00 [ 0.00]( 3.08) 1.12 [ 12.12]( 1.71) 1.10 [ 10.19]( 3.06)
128 1.00 [ 0.00]( 1.54) 1.01 [ 1.02]( 0.65) 0.98 [ -2.23]( 0.62)
256 1.00 [ 0.00]( 1.09) 0.95 [ -5.42]( 0.19) 0.92 [ -7.86]( 0.50) *
512 1.00 [ 0.00]( 0.20) 0.91 [ -9.03]( 0.20) 0.90 [-10.25]( 0.29) *
1024 1.00 [ 0.00]( 0.22) 0.88 [-12.47]( 0.29) 0.87 [-13.46]( 0.49) *


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
Copy 1.00 [ 0.00]( 3.95) 1.00 [ 0.03]( 4.32) 1.02 [ 2.26]( 2.73)
Scale 1.00 [ 0.00]( 8.33) 1.05 [ 5.17]( 5.21) 1.05 [ 4.80]( 5.48)
Add 1.00 [ 0.00]( 8.15) 1.05 [ 4.50]( 6.25) 1.04 [ 4.44]( 5.53)
Triad 1.00 [ 0.00]( 3.11) 0.93 [ -6.55](10.74) 0.97 [ -2.86]( 7.14)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
Copy 1.00 [ 0.00]( 0.95) 1.00 [ 0.30]( 0.70) 1.00 [ 0.30]( 1.08)
Scale 1.00 [ 0.00]( 0.73) 0.97 [ -2.93]( 6.55) 1.00 [ 0.15]( 0.82)
Add 1.00 [ 0.00]( 1.69) 0.98 [ -2.19]( 6.53) 1.01 [ 0.88]( 1.08)
Triad 1.00 [ 0.00]( 7.49) 1.02 [ 2.02]( 6.66) 1.05 [ 4.88]( 4.56)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.07) 1.00 [ 0.42]( 0.46) 1.01 [ 1.02]( 0.70)
2-clients 1.00 [ 0.00]( 0.78) 1.00 [ -0.26]( 0.38) 1.00 [ 0.40]( 0.92)
4-clients 1.00 [ 0.00]( 0.96) 1.01 [ 0.77]( 0.72) 1.01 [ 1.07]( 0.83)
8-clients 1.00 [ 0.00]( 0.53) 1.00 [ -0.30]( 0.98) 1.00 [ 0.15]( 0.82)
16-clients 1.00 [ 0.00]( 1.05) 1.00 [ 0.22]( 0.70) 1.01 [ 0.54]( 1.26)
32-clients 1.00 [ 0.00]( 1.29) 1.00 [ 0.12]( 0.74) 1.00 [ 0.16]( 1.24)
64-clients 1.00 [ 0.00]( 2.80) 1.00 [ -0.27]( 2.24) 1.00 [ 0.32]( 3.06)
128-clients 1.00 [ 0.00]( 1.57) 1.00 [ -0.42]( 1.72) 0.99 [ -0.63]( 1.64)
256-clients 1.00 [ 0.00]( 3.85) 1.02 [ 2.40]( 4.44) 1.00 [ 0.45]( 3.71)
512-clients 1.00 [ 0.00](45.83) 1.00 [ 0.12](52.42) 0.97 [ -2.75](57.69)


==================================================================
Test : schbench (old)
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV)
1 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28)
2 1.00 [ -0.00](11.27) 1.27 [-27.27]( 6.42) 1.14 [-13.64](11.02) *
4 1.00 [ -0.00]( 1.95) 1.00 [ -0.00]( 3.77) 0.93 [ 6.67]( 4.22)
8 1.00 [ -0.00]( 4.17) 1.03 [ -2.70](13.83) 0.95 [ 5.41]( 1.63)
16 1.00 [ -0.00]( 4.17) 0.98 [ 2.08]( 4.37) 1.04 [ -4.17]( 3.53)
32 1.00 [ -0.00]( 1.89) 1.00 [ -0.00]( 8.69) 0.96 [ 3.70]( 5.14)
64 1.00 [ -0.00]( 3.66) 1.03 [ -3.31]( 2.30) 1.06 [ -5.96]( 2.56)
128 1.00 [ -0.00]( 5.79) 0.85 [ 14.77](12.12) 0.97 [ 3.15]( 6.76) ^
256 1.00 [ -0.00]( 8.50) 0.15 [ 84.84](26.04) 0.17 [ 83.43]( 8.04) ^
512 1.00 [ -0.00]( 2.01) 0.28 [ 72.09]( 5.62) 0.28 [ 72.35]( 3.48) ^


==================================================================
Test : Unixbench
Units : Various, Throughput
Interpretation: Higher is better
Statistic : AMean, Hmean (Specified)
==================================================================

tip eevdf eevdf-curb
Hmean unixbench-dhry2reg-1 41333812.04 ( 0.00%) 41248390.97 ( -0.21%) 41576959.80 ( 0.59%)
Hmean unixbench-dhry2reg-512 6244993319.97 ( 0.00%) 6239969914.15 ( -0.08%) 6223263669.12 ( -0.35%)
Amean unixbench-syscall-1 2932426.17 ( 0.00%) 2968518.27 * -1.23%* 2923093.63 * 0.32%*
Amean unixbench-syscall-512 7670057.70 ( 0.00%) 7790656.20 * -1.57%* 8300980.77 * 8.23%* ^
Hmean unixbench-pipe-1 2571551.92 ( 0.00%) 2535689.01 * -1.39%* 2472718.52 * -3.84%*
Hmean unixbench-pipe-512 366469338.93 ( 0.00%) 361385055.25 * -1.39%* 363215893.62 * -0.89%*
Hmean unixbench-spawn-1 4263.51 ( 0.00%) 4506.26 * 5.69%* 4520.53 * 6.03%* ^
Hmean unixbench-spawn-512 67782.44 ( 0.00%) 69380.09 * 2.36%* 69709.04 * 2.84%*
Hmean unixbench-execl-1 3829.47 ( 0.00%) 3824.57 ( -0.13%) 3835.20 ( 0.15%)
Hmean unixbench-execl-512 11929.77 ( 0.00%) 12288.64 ( 3.01%) 13096.25 * 9.78%* ^


==================================================================
Test : ycsb-mongodb
Units : Throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================

base 303129.00 (var: 0.68%)
eevdf 309589.33 (var: 1.41%) (+2.13%)
eevdf-curb 303940.00 (var: 1.09%) (+0.27%)


==================================================================
Test : DeathStarBench
Units : %diff of Throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================

base eevdf eevdf_curb
1CCD 0% -15.15% -16.55%
2CCD 0% -13.80% -16.23%
4CCD 0% -7.50% -10.11%
8CCD 0% -3.42% -3.68%

--

I'll go back to profile hackbench, tbench, and DeathStarBench. Will keep
the thread updated of any findings. Let me know if you have any pointers
for the debug. I plan on using Chenyu's schedstats extension unless IBS
or idle-info show some obvious problems - Thank you Chenyu for sharing
the schedstats patch :)

>
> Suggested-by: Mike Galbraith <umgwanakikbuti@xxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Link: https://lkml.kernel.org/r/20230816134059.GC982867@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> ---
> kernel/sched/fair.c | 12 ++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 13 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f496cef..0b7445c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -873,6 +873,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> curr = NULL;
>
> + /*
> + * Once selected, run a task until it either becomes non-eligible or
> + * until it gets a new slice. See the HACK in set_next_entity().
> + */
> + if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> + return curr;
> +
> while (node) {
> struct sched_entity *se = __node_2_se(node);
>
> @@ -5167,6 +5174,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> update_stats_wait_end_fair(cfs_rq, se);
> __dequeue_entity(cfs_rq, se);
> update_load_avg(cfs_rq, se, UPDATE_TG);
> + /*
> + * HACK, stash a copy of deadline at the point of pick in vlag,
> + * which isn't used until dequeue.
> + */
> + se->vlag = se->deadline;
> }
>
> update_stats_curr_start(cfs_rq, se);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 61bcbf5..f770168 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -6,6 +6,7 @@
> */
> SCHED_FEAT(PLACE_LAG, true)
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> +SCHED_FEAT(RUN_TO_PARITY, true)
>
> /*
> * Prefer to schedule the task we woke last (assuming it failed


--
Thanks and Regards,
Prateek