Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

From: Chen Yu
Date: Thu Aug 10 2023 - 22:42:37 EST


Hi Peter,

On 2023-08-11 at 09:11:21 +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> >
> >
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> >
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> >
> > test: blogbench-1.1.0
> > option_a: Write
> > cpufreq_governor: performance
> >
>
> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
> = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem.
>
>

Since previously lkp has reported that with eevdf policy enabled, there was
a regression in hackbench, I did some experiments and found that, with eevdf
enabled there are more preemptions, and this preemption could slow down
the waker(each waker could wakes up 20 wakee in hackbench). The reason might
be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

baseline(no eevdf): sched/fair: Add cfs_rq::avg_vruntime
compare(eevdf): sched/smp: Use lag to simplify cross-runqueue placement

hackbench -g 112 -f 20 --threads -l 30000 -s 100
on a system with 1 socket 56C/112T online

1. Use the following bpf script to track the preemption count
within 10 seconds:

tracepoint:sched:sched_switch
{
if (args->prev_state == TASK_RUNNING) {
@preempt = count();
}
}

baseline:
bpftrace preemption.bt
Attaching 4 probes...
10:54:45 Preemption count:
@preempt: 2409638


compare:
bpftrace preemption.bt
Attaching 4 probes...
10:02:21 Preemption count:
@preempt: 12147709

There are much more preemptions with eevdf enabled.


2. Add the following schedstats to track the ratio of successful preemption
in check_preempt_wakeup():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..dfd4a6ebdf23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8294,6 +8294,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
int next_buddy_marked = 0;
int cse_is_idle, pse_is_idle;

+ schedstat_inc(rq->check_preempt_count);
+
if (unlikely(se == pse))
return;

@@ -8358,8 +8360,12 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) == pse) {
+ if (se->vruntime <= pse->vruntime + sysctl_sched_wakeup_granularity)
+ schedstat_inc(rq->low_gran_preempt_count);
+
goto preempt;
+ }

return;
}
@@ -8377,6 +8383,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;

preempt:
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..58abd3d53f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,9 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int low_gran_preempt_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..99392cad0c07 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count, rq->need_preempt_count,
+ rq->low_gran_preempt_count);

seq_printf(seq, "\n");

--
2.25.1

Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 11:02:02 2023 cpu8
.stats.check_preempt_count 51973 <-----
.stats.need_preempt_count 10514 <-----
.stats.rq_cpu_time 5004068598
.stats.rq_sched_info.pcount 60374
.stats.rq_sched_info.run_delay 80405664582
.stats.sched_count 60609
.stats.sched_goidle 227
.stats.ttwu_count 56250
.stats.ttwu_local 14619

The preemption success ration is 10514 / 51973 = 20.23%
-----------------------------------------------------------------------------

With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 10:22:55 2023 cpu8
.stats.check_preempt_count 71673 <----
.stats.low_gran_preempt_count 57410
.stats.need_preempt_count 57413 <----
.stats.rq_cpu_time 5007778990
.stats.rq_sched_info.pcount 129233
.stats.rq_sched_info.run_delay 164830921362
.stats.sched_count 129233
.stats.ttwu_count 70222
.stats.ttwu_local 66847

The preemption success ration is 57413 / 71673 = 80.10%

According to the low_gran_preempt_count, most successfully preemption happens
when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
which will not happen in current cfs's wakeup_preempt_entity().

It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
maybe it is because eevdf needs to consider fairness more?

thanks,
Chenyu