Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

From: Chen Yu
Date: Tue Aug 15 2023 - 23:49:10 EST

Next message: Theodore Ts'o: "Re: [PATCH 13/13] ext4: remove unnecessary initialization of count2 in set_flexbg_block_bitmap"
Previous message: Nick Hu: "[PATCH] riscv: suspend: Add syscore ops for suspend"
In reply to: Peter Zijlstra: "Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression"
Next in thread: Peter Zijlstra: "Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-08-14 at 15:29:35 +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2023 at 10:42:09AM +0800, Chen Yu wrote:
>
> > Since previously lkp has reported that with eevdf policy enabled, there was
> > a regression in hackbench, I did some experiments and found that, with eevdf
> > enabled there are more preemptions, and this preemption could slow down
> > the waker(each waker could wakes up 20 wakee in hackbench). The reason might
> > be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:
>
> This is true.
>
> > Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> > Thu Aug 10 11:02:02 2023 cpu8
> > .stats.check_preempt_count 51973 <-----
> > .stats.need_preempt_count 10514 <-----
> > .stats.rq_cpu_time 5004068598
> > .stats.rq_sched_info.pcount 60374
> > .stats.rq_sched_info.run_delay 80405664582
> > .stats.sched_count 60609
> > .stats.sched_goidle 227
> > .stats.ttwu_count 56250
> > .stats.ttwu_local 14619
> >
> > The preemption success ration is 10514 / 51973 = 20.23%
> > -----------------------------------------------------------------------------
> >
> > With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> > Thu Aug 10 10:22:55 2023 cpu8
> > .stats.check_preempt_count 71673 <----
> > .stats.low_gran_preempt_count 57410
> > .stats.need_preempt_count 57413 <----
> > .stats.rq_cpu_time 5007778990
> > .stats.rq_sched_info.pcount 129233
> > .stats.rq_sched_info.run_delay 164830921362
> > .stats.sched_count 129233
> > .stats.ttwu_count 70222
> > .stats.ttwu_local 66847
> >
> > The preemption success ration is 57413 / 71673 = 80.10%
>
> note: wakeup-preemption
>
> > According to the low_gran_preempt_count, most successfully preemption happens
> > when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
> > which will not happen in current cfs's wakeup_preempt_entity().
> >
> > It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
> > maybe it is because eevdf needs to consider fairness more?
>
> Not fairness, latency. Because it wants to honour the virtual deadline.
>
>
> Are these wakeup preemptions typically on runqueues that have only a
> single other task?
>

There are 112 groups of hackbench waker/wakee on a 112 CPUs system.
Each group has 1 waker and 20 wakees. It seems that there are more than 1
other task on each runqueue. I collected the statistics below.

> That is, consider a single task running, then avg_vruntime will be it's
> vruntime, because the average of one variable must be the value of that
> one variable.
>
> Then the moment a second task joins, we get two options:
>
> - positive lag
> - negative lag
>
> When the new task has negative lag, it gets placed to the right of the
> currently running task (and avg_vruntime has a forward discontinuity).
> At this point the new task is not eligible and does not get to run.
>
> When the new task has positive lag, it gets placed to the left of the
> currently running task (and avg_vruntime has a backward discontinuity).
> At this point the currently running task is no longer eligible, and the
> new task must be selected -- irrespective of it's deadline.
>
> The paper doesn't (AFAIR) consider the case of wake-up-preemption
> explicitly. It only considers task selection and vruntime placement.
>
> One option I suppose would be to gate the wakeup preemption by virtual
> deadline, only allow when the new task has an earlier deadline than the
> currently running one, and otherwise rely on tick preemption.
>
> NOTE: poking at wakeup preemption is a double edged sword, some
> workloads love it, some hate it. Touching it is bound to upset the
> balance -- again.
>
> (also, did I get that the right way around? -- I've got a Monday brain
> that isn't willing to boot properly)
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe5be91c71c7..16d24e5dda8f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8047,6 +8047,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> cfs_rq = cfs_rq_of(se);
> update_curr(cfs_rq);
>
> + if (sched_feat(WAKEUP_DEADLINE)) {
> + /*
> + * Only allow preemption if the virtual deadline of the new
> + * task is before the virtual deadline of the existing task.
> + */
> + if (deadline_gt(deadline, pse, se))
> + return;
> + }
> +
> /*
> * XXX pick_eevdf(cfs_rq) != se ?
> */
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 61bcbf5e46a4..e733981b32aa 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
> * Allow wakeup-time preemption of the current task:
> */
> SCHED_FEAT(WAKEUP_PREEMPTION, true)
> +SCHED_FEAT(WAKEUP_DEADLINE, true)
>
> SCHED_FEAT(HRTICK, false)
> SCHED_FEAT(HRTICK_DL, false)

Added more schedstats to track the counter, when the wakee
succeed to preempt the current task:

1. curr's deadline is smaller than the wakee's deadline
2. curr is not eligible
3. the cfs_rq->nr_running <= 2

Also applied Mike's RUN_TO_PARITY patch with some minor change
to only apply RUN_TO_PARITY for wakeup:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..da7260ddd7e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -931,14 +931,20 @@ static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity
*
* Which allows an EDF like search on (sub)trees.
*/
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, int wake)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;

- if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+ if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) {
curr = NULL;
+ if (wake)
+ schedstat_inc(cfs_rq->rq->curr_noeli_preempt_count);
+ }
+
+ if (sched_feat(RUN_TO_PARITY) && wake && curr)
+ return curr;

while (node) {
struct sched_entity *se = __node_2_se(node);
@@ -5441,7 +5447,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
return cfs_rq->next;

- return pick_eevdf(cfs_rq);
+ return pick_eevdf(cfs_rq, 0);
}

se = left = pick_cfs(cfs_rq, curr);
@@ -8294,6 +8300,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
int next_buddy_marked = 0;
int cse_is_idle, pse_is_idle;

+ schedstat_inc(rq->check_preempt_count);
+
if (unlikely(se == pse))
return;

@@ -8358,8 +8366,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq, 1) == pse) {
+ if (cfs_rq->curr->vruntime <
+ pse->vruntime + sysctl_sched_wakeup_granularity) {
+ schedstat_inc(rq->curr_lowgran_preempt_count);
+ }
+ if (cfs_rq->curr->deadline < pse->deadline) {
+ schedstat_inc(rq->curr_lowdl_preempt_count);
+ }
+ if (cfs_rq->nr_running <= 2) {
+ schedstat_inc(rq->curr_lownr_preempt_count);
+ }
goto preempt;
+ }

return;
}
@@ -8377,6 +8396,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;

preempt:
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 2a830eccda3e..d021dd0e0226 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -6,6 +6,7 @@
*/
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+SCHED_FEAT(RUN_TO_PARITY, true)

/*
* Prefer to schedule the task we woke last (assuming it failed
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..b6e245d2bba5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,12 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int curr_lowgran_preempt_count;
+ unsigned int curr_lowdl_preempt_count;
+ unsigned int curr_noeli_preempt_count;
+ unsigned int curr_lownr_preempt_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..e7f02e7bfff2 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,16 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count, rq->need_preempt_count, rq->curr_lowgran_preempt_count,
+ rq->curr_lowdl_preempt_count,
+ rq->curr_noeli_preempt_count,
+ rq->curr_lownr_preempt_count);

seq_printf(seq, "\n");

--
2.25.1

NO_RUN_TO_PARITY:

Tue Aug 15 19:16:30 2023 cpu8
.stats.check_preempt_count 85754
.stats.curr_lowdl_preempt_count 5189
.stats.curr_lowgran_preempt_count 70063
.stats.curr_lownr_preempt_count 31
.stats.curr_noeli_preempt_count 30252
.stats.need_preempt_count 70066
.stats.rq_cpu_time 5032932195
.stats.rq_sched_info.pcount 155748
.stats.rq_sched_info.run_delay 159730175645
.stats.sched_count 155748
.stats.ttwu_count 84568
.stats.ttwu_local 81029

When the wakee succeeds to preempt the current task:

1. curr's deadline is smaller than wakee's deadline:
5189 / 70066 = 7.41%

2. curr is not eligible
30252 / 70066 = 43.18%

3. the cfs_rq->nr_running <= 2
31 / 70066 = 0.04%

According to above data, maybe comparing the deadline does not take much
effect in this case(per 1). And the scenario of only 1 other task
seldom happens(per 3). And considering the current task's eligibility
without checking deadline might help(per 2, there are (70066 - 30252) times
that the current is eligible but preempted, and Mike's patch addressed
that)

RUN_TO_PARITY:

Tue Aug 15 19:19:07 2023 cpu8
.stats.check_preempt_count 20147
.stats.curr_lowdl_preempt_count 334
.stats.curr_lowgran_preempt_count 4637
.stats.curr_lownr_preempt_count 8
.stats.curr_noeli_preempt_count 5451
.stats.need_preempt_count 4645
.stats.rq_cpu_time 5052091287
.stats.rq_sched_info.pcount 25486
.stats.rq_sched_info.run_delay 166261574881
.stats.sched_count 25486
.stats.ttwu_count 20146
.stats.ttwu_local 20146

With Mike's change, the wakeup preemption success ratio drops
to 4645 / 20147 = 23.06%, which is very close to cfs wakeup preemption
ratio 20.23%. And the throughput has been increased:

echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features
hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 117.487

echo RUN_TO_PARITY > /sys/kernel/debug/sched/features
hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 101.285 <--- lower the better

Also tested schbench with/without RUN_TO_PARITY, no much difference(especially wakeup
latency) is observed:

echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features
./schbench -m 14 -t 16 -r 100
Wakeup Latencies percentiles (usec) runtime 100 (s) (941339 total samples)
50.0th: 4 (53382 samples)
90.0th: 3772 (353480 samples)
* 99.0th: 7448 (84115 samples)
99.9th: 7976 (8244 samples)
min=1, max=15186
Request Latencies percentiles (usec) runtime 100 (s) (941598 total samples)
50.0th: 23264 (283891 samples)
90.0th: 32672 (375363 samples)
* 99.0th: 52288 (83963 samples)
99.9th: 95360 (8449 samples)
min=4487, max=241201
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 9264 (22 samples)
* 50.0th: 9392 (33 samples)
90.0th: 9584 (36 samples)
min=9055, max=10176
average rps: 9415.98

echo RUN_TO_PARITY > /sys/kernel/debug/sched/features
./schbench -m 14 -t 16 -r 100
Wakeup Latencies percentiles (usec) runtime 100 (s) (943005 total samples)
50.0th: 4 (53961 samples)
90.0th: 3772 (366473 samples)
* 99.0th: 7496 (84902 samples)
99.9th: 9296 (8441 samples)
min=1, max=20467
Request Latencies percentiles (usec) runtime 100 (s) (943206 total samples)
50.0th: 23072 (283181 samples)
90.0th: 32096 (376473 samples)
* 99.0th: 53568 (84586 samples)
99.9th: 99456 (8443 samples)
min=4078, max=219330
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 9296 (22 samples)
* 50.0th: 9392 (34 samples)
90.0th: 9584 (38 samples)
min=9065, max=10188
average rps: 9432.06

thanks,
Chenyu

Next message: Theodore Ts'o: "Re: [PATCH 13/13] ext4: remove unnecessary initialization of count2 in set_flexbg_block_bitmap"
Previous message: Nick Hu: "[PATCH] riscv: suspend: Add syscore ops for suspend"
In reply to: Peter Zijlstra: "Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression"
Next in thread: Peter Zijlstra: "Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]