Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

From: Daniel Jordan
Date: Wed Aug 23 2023 - 20:54:33 EST


Hi Peter,

On Wed, May 31, 2023 at 01:58:39PM +0200, Peter Zijlstra wrote:
>
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> The only real change since last time is the fix for tick-preemption [2], and a
> simple safe-guard for the mixed slice heuristic.

We're seeing regressions from EEVDF with SPEC CPU, a database workload,
and a Java workload. We tried SPEC CPU on five systems, and here are
numbers from one of them (high core count, two-socket x86 machine).

SPECrate2017 oversubscribed by 2x (two copies of the test per CPU)

Base: v6.3-based kernel
EEVDF: Base + patches from May 31 [0]

Performance comparison: >0 if EEVDF wins

Integer

-0.5% 500.perlbench_r
-6.6% 502.gcc_r
-8.7% 505.mcf_r
-9.2% 520.omnetpp_r
-6.6% 523.xalancbmk_r
-0.7% 525.x264_r
-2.1% 531.deepsjeng_r
-0.4% 541.leela_r
-0.3% 548.exchange2_r
-2.6% 557.xz_r

-3.8% Est(*) SPECrate2017_int_base

Floating Point

-0.6% 503.bwaves_r
-1.3% 507.cactuBSSN_r
-0.8% 508.namd_r
-17.8% 510.parest_r
0.3% 511.povray_r
-1.0% 519.lbm_r
-7.7% 521.wrf_r
-2.4% 526.blender_r
-6.1% 527.cam4_r
-2.0% 538.imagick_r
0.1% 544.nab_r
-0.7% 549.fotonik3d_r
-11.3% 554.roms_r

-4.1% Est(*) SPECrate2017_fp_base

(*) SPEC CPU Fair Use rules require that tests with non-production
components must be marked as estimates.

The other machines show similarly consistent regressions, and we've tried a
v6.5-rc4-based kernel with the latest EEVDF patches from tip/sched/core
including the recent fixlet "sched/eevdf: Curb wakeup-preemption". I can post
the rest of the numbers, but I'm trying to keep this on the shorter side for
now.

Running the database workload on a two-socket x86 server, we see
regressions of up to 6% when the number of users exceeds the number of
CPUs.

With the Java workload on another two-socket x86 server, we see a 10%
regression.

We're investigating the other benchmarks, but here's what I've found so far
with SPEC CPU. Some schedstats showed that eevdf is tick-preemption happy
(patches below). These stats were taken over 1 minute near the middle of a ~26
minute benchmark (502.gcc_r).

Base: v6.5-rc4-based kernel
EEVDF: Base + the latest EEVDF patches from tip/sched/core

schedstat Base EEVDF

sched 1,243,911 3,947,251

tick_check_preempts 12,899,049
tick_preempts 1,022,998

check_deadline 15,878,463
update_deadline 3,895,530
preempt_deadline 3,751,580

In both kernels, tick preemption is primarily what drives schedule()s.
Preemptions happen over three times more often for EEVDF because in the base,
tick preemption happens after a task has run through its ideal timeslice as a
fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
slice determines when it gets tick-preempted, and that's 3ms by default. It
seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
load like this. When I set the base_slice_ns sysctl to 12000000, the
regression disappears.

I'm still thinking about how to fix it. Pre-EEVDF, tick preemption was
more flexible in that a task's timeslice could change depending on how
much competition it had on the same CPU, but with EEVDF the timeslice is
fixed no matter what else is running, and growing or shrinking it
depending on nr_running doesn't honor whatever deadline was set for the
task.

The schedstat patch for the base:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..fb5a35aa07ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,6 +4996,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
struct sched_entity *se;
s64 delta;

+ schedstat_inc(rq_of(cfs_rq)->tick_check_preempts);
+
/*
* When many tasks blow up the sched_period; it is possible that
* sched_slice() reports unusually large results (when many tasks are
@@ -5005,6 +5007,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
+ schedstat_inc(rq_of(cfs_rq)->tick_preempts);
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
@@ -5028,8 +5031,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
if (delta < 0)
return;

- if (delta > ideal_runtime)
+ if (delta > ideal_runtime) {
+ schedstat_inc(rq_of(cfs_rq)->tick_preempts);
resched_curr(rq_of(cfs_rq));
+ }
}

static void
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..1bf12e271756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,6 +1123,10 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* tick preempt stats */
+ unsigned int tick_check_preempts;
+ unsigned int tick_preempts;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..7997b8538b72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,13 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->tick_check_preempts, rq->tick_preempts);

seq_printf(seq, "\n");


The schedstat patch for eevdf:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffec98724f3..675f4bbac471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,18 +975,21 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ schedstat_inc(rq_of(cfs_rq)->check_deadline);
if ((s64)(se->vruntime - se->deadline) < 0)
return;

/*
* EEVDF: vd_i = ve_i + r_i / w_i
*/
+ schedstat_inc(rq_of(cfs_rq)->update_deadline);
se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

/*
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
+ schedstat_inc(rq_of(cfs_rq)->preempt_deadline);
resched_curr(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93c2dc80143f..c44b59556367 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,11 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* update_deadline() stats */
+ unsigned int check_deadline;
+ unsigned int update_deadline;
+ unsigned int preempt_deadline;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..2a8bd742507d 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_deadline, rq->update_deadline,
+ rq->preempt_deadline);

seq_printf(seq, "\n");


[0] https://lore.kernel.org/all/20230531115839.089944915@xxxxxxxxxxxxx/