[RFC PATCH 0/1] sched/eevdf: Curb wakeup preemption further

From: K Prateek Nayak
Date: Mon Mar 25 2024 - 12:32:40 EST


Problem Statement
=================

Since EEVDF first landed, DeathStarBench [1] consistently saw a double
digit drop in performance when all the services are running pinned on a
single / multiple CCX(s).

Bisection showed that the regression started at commit 147f3efaa241
("sched/fair: Implement an EEVDF-like scheduling policy") and this was
reported [2]. Further narrowing down than the commit is hard due to the
extent of changes in the commit. EEVDF has seen extensive development
since, but the regression persists.


Narrowing down the problem
==========================

Commit 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling
policy") also added a sched_feat to toggle between EEVDF and legacy CFS.
It was noted that switching to CFS did not completely remedy the
regression. Although some of it could be attributed to the plumbing of
the sched_feat itself, the large enough regression led to audit of the
common path between CFS and EEVDF added since commit 86bfbb7ce4f6
("sched/fair: Add lag based placement"). It was discovered that the
clamping of lag in the offending commit was enough to reproduce the
regression. Specifically, the following change on top of commit
86bfbb7ce4f6 ("sched/fair: Add lag based placement"):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd12ada69b12..93b5e606e8e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,13 +715,20 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
return cfs_rq->min_vruntime + avg;
}

+static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se);
+
/*
* lag_i = S - s_i = w_i * (V - v_i)
*/
void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ s64 lag, limit;
+
SCHED_WARN_ON(!se->on_rq);
- se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+ lag = avg_vruntime(cfs_rq) - se->vruntime;
+
+ limit = calc_delta_fair(max_t(u64, 2 * sysctl_sched_min_granularity, TICK_NSEC), se);
+ se->vlag = clamp(lag, -limit, limit);
}

static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
--

This turned the attention towards the wakeup path since lag was only
used to determine the vruntime of the entity in place_entity() until
then.


sched-scoreboard analysis
=========================

When analyzing the per-task data from the sched-scoreboard [3], it was
noticed that NGINX server and load-balancer accepting the incoming
requests was being preempted very often. Following are the relative
sched-switch numbers for both algorithms:

86bfbb7ce4f6 (CFS) 147f3efaa241 (EEVDF) %diff
Sum of nr_voluntary_switches 1.00 0.02 -98%
Sum of nr_involuntary_switches 1.00 3.20 220%

Most of the other tasks within the chain of services invoked by the
DeathStarBench workflow saw no significant changes in per-task
statistics except for a drop in number of wakeups proportional to the
drop in the benchmark throughput for the leaf service.

Looking at pick_eevdf() called during wakeup preemption, the eligibility
check for the current entity was identified as the reason for this large
preemption.


Wakeup preemption problem
=========================

With the curr entity's eligibility check, a wakeup preemption is very
likely when an entity with positive lag joins the runqueue pushing the
avg_vruntime of the runqueue backwards, making the vruntime of the
current entity ineligible. This leads to aggressive wakeup preemption
which was previously guarded by wakeup_granularity_ns in legacy CFS.
Below figure depicts one such aggressive preemption scenario with EEVDF:

deadline for Nginx
|
+-------+ | |
/-- | Nginx | -|------------------> |
| +-------+ | |
| |
| -----------|-------------------------------> vruntime timeline
| \--> rq->avg_vruntime
|
| wakes service on the same runqueue since system is busy
|
| +---------+|
\-->| Service || (service has +ve lag pushes avg_vruntime backwards)
+---------+|
| |
wakeup | +--|-----+ |
preempts \---->| N|ginx | -----------> | {deadline for Nginx}
+--|-----+ |
(Nginx ineligible)
-----------|-------------------------------> vruntime timeline
\--> rq->avg_vruntime

When NGINX server is involuntarily switched out, it cannot accept any
incoming request, leading to longer turn around time for the clients and
thus loss in DeathStarBench throughput.

==================================================================
Test : DeathStarBench
Units : Normalized latency
Interpretation: Lower is better
Statistic : Mean
==================================================================
tip 1.00
eevdf 1.14 (+14.61%)


Proposed solution
=================

For current running task, skip eligibility check in pick_eevdf() if it
has not exhausted the slice promised to it during selection despite the
situation having changed since. With the proposed solution, the
performance loss seen with the merge of EEVDF goes away. Following are
the results from testing on a Dual Socket 3rd Generation EPYC server
(2 x 64C/128T):

==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip run-to-parity-wakeup(pct imp)
1CCD 1 1.00 1.16 (%diff: 16%)
2CCD 2 1.00 1.03 (%diff: 3%)
4CCD 4 1.00 1.12 (%diff: 12%)
8CCD 8 1.00 1.05 (%diff: 6%)

With spec_rstack_overflow=off, the DeathStarBench performance with the
proposed solution is same as the performance on v6.5 release before
EEVDF was merged.


Implication
===========

This may lead to newly waking task waiting longer for its turn on the
CPU. However testing on a 3rd Generation Dual Socket EPYC system
(2 x 64C/128T) did not reveal any consistent regressions with the
standard benchmarks:

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.08) 0.98 [ 2.29]( 3.19)
2-groups 1.00 [ -0.00]( 0.89) 1.02 [ -2.22]( 1.27)
4-groups 1.00 [ -0.00]( 0.81) 1.00 [ 0.30]( 0.97)
8-groups 1.00 [ -0.00]( 0.78) 0.97 [ 2.62]( 0.80)
16-groups 1.00 [ -0.00]( 1.60) 0.96 [ 3.87]( 2.57)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1 1.00 [ 0.00]( 0.71) 1.01 [ 1.13]( 0.20)
2 1.00 [ 0.00]( 0.25) 1.00 [ -0.42]( 0.32)
4 1.00 [ 0.00]( 0.85) 1.00 [ 0.07]( 0.41)
8 1.00 [ 0.00]( 1.00) 0.99 [ -1.02]( 0.57)
16 1.00 [ 0.00]( 1.25) 0.99 [ -1.13]( 1.90)
32 1.00 [ 0.00]( 0.35) 0.98 [ -1.90]( 2.58)
64 1.00 [ 0.00]( 0.71) 0.98 [ -2.20]( 1.33)
128 1.00 [ 0.00]( 0.46) 0.98 [ -1.94]( 0.42)
256 1.00 [ 0.00]( 0.24) 0.98 [ -2.19]( 0.24)
512 1.00 [ 0.00]( 0.30) 0.99 [ -0.87]( 0.15)
1024 1.00 [ 0.00]( 0.40) 1.00 [ -0.29]( 0.75)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
Copy 1.00 [ 0.00]( 9.73) 1.01 [ 0.53]( 6.33)
Scale 1.00 [ 0.00]( 5.57) 1.00 [ -0.17]( 3.89)
Add 1.00 [ 0.00]( 5.43) 1.00 [ 0.48]( 3.63)
Triad 1.00 [ 0.00]( 5.50) 0.98 [ -1.90]( 6.31)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
Copy 1.00 [ 0.00]( 3.26) 0.99 [ -0.78]( 2.94)
Scale 1.00 [ 0.00]( 1.26) 0.97 [ -2.86]( 6.48)
Add 1.00 [ 0.00]( 1.47) 0.99 [ -1.34]( 4.20)
Triad 1.00 [ 0.00]( 1.77) 1.01 [ 0.53]( 2.23)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.22) 1.02 [ 2.26]( 1.08)
2-clients 1.00 [ 0.00]( 0.57) 1.02 [ 2.02]( 1.03)
4-clients 1.00 [ 0.00]( 0.43) 1.02 [ 2.35]( 0.64)
8-clients 1.00 [ 0.00]( 0.27) 1.02 [ 2.32]( 0.72)
16-clients 1.00 [ 0.00]( 0.46) 1.02 [ 1.55]( 0.73)
32-clients 1.00 [ 0.00]( 0.95) 1.00 [ -0.46]( 1.47)
64-clients 1.00 [ 0.00]( 1.79) 0.97 [ -2.72]( 1.34)
128-clients 1.00 [ 0.00]( 0.89) 1.00 [ 0.25]( 0.89)
256-clients 1.00 [ 0.00]( 8.68) 0.95 [ -4.96]( 7.11) *
512-clients 1.00 [ 0.00](35.06) 1.00 [ -0.43](51.55)

==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1 1.00 [ -0.00](27.28) 0.69 [ 31.25](41.89)
2 1.00 [ -0.00]( 3.85) 1.00 [ -0.00](11.47)
4 1.00 [ -0.00](14.00) 1.21 [-21.05]( 7.20) *
8 1.00 [ -0.00]( 4.68) 1.04 [ -4.17](13.10) *
16 1.00 [ -0.00]( 4.08) 0.98 [ 1.61]( 1.64)
32 1.00 [ -0.00]( 6.68) 0.94 [ 6.12]( 9.27)
64 1.00 [ -0.00]( 1.79) 0.98 [ 1.53]( 2.36)
128 1.00 [ -0.00]( 6.30) 1.03 [ -3.16]( 8.92) *
256 1.00 [ -0.00](43.39) 1.36 [-35.94](12.43) *
512 1.00 [ -0.00]( 2.26) 0.78 [ 22.04]( 1.77)

==================================================================
Test : SPECjbb (Without any scheduler tunable fiddling)
Units : Throughput
Interpretation: Lower is better
Statistic : Mean
==================================================================
Metric tip run-to-parity-wakeup
Max-jOPS 1.00 1.01 (+1.21%)
Critical-jOPS 1.00 1.01 (+1.05%)

* Points of large regression also show large run-to-run variance on both
tip:sched/core and with the proposed solution.


Future work and alternatives
============================

If wakeup latency is indeed a concern, this solution can be paired with
"sched/eevdf: Allow shorter slices to wakeup-preempt" patch from Peter's
EEVDF tree [4] to allow for latency sensitive tasks to preempt the
current running task with "PREEMPT_SHORT" feat superseding
"RUN_TO_PARITY_WAKEUP" decision for such tasks.

If not "RUN_TO_PARITY_WAKEUP" a tunable, similar to
"wakeup_granularity_ns" might need to be re-introduced to curb such
scenarios with EEVDF.

Tobias seems to be dealing with other side of the coin where a newly
woken task is not being scheduled in fast enough on the CPU [5]. It'll
be good to discuss how to keep both sides happy, or discover a middle
ground where both are less unhappy :)


References
==========

[1] https://github.com/delimitrou/DeathStarBench/
[2] https://lore.kernel.org/all/18199afa-16e2-8e76-c96b-9507f92befdc@xxxxxxx/
[3] https://github.com/AMDESE/sched-scoreboard/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=c2594f33fb1e1f4a2a4d962c464cb63a27a8f3d9 (Prone to force-update)
[5] https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@xxxxxxxxxxxxx/

The patch applies cleanly on tip:sched/core at commit b9e6e2866392
("sched/fair: Fix typos in comments")
---
K Prateek Nayak (1):
sched/eevdf: Skip eligibility check for current entity during wakeup
preemption

kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/features.h | 1 +
2 files changed, 21 insertions(+), 4 deletions(-)

--
2.34.1