Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

From: Youssef Esmat
Date: Mon Mar 25 2024 - 18:26:05 EST


On Mon, Mar 25, 2024 at 1:03 AM K Prateek Nayak <kprateek.nayak@amdcom> wrote:
>
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.
> Below figure depicts one such aggressive preemption scenario with EEVDF
> in DeathStarBench [1]:
>
> deadline for Nginx
> |
> +-------+ | |
> /-- | Nginx | -|------------------> |
> | +-------+ | |
> | |
> | -----------|-------------------------------> vruntime timeline
> | \--> rq->avg_vruntime
> |
> | wakes service on the same runqueue since system is busy
> |
> | +---------+|
> \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
> +---------+|
> | |
> wakeup | +--|-----+ |
> preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
> +--|-----+ |
> (Nginx ineligible)
> -----------|-------------------------------> vruntime timeline
> \--> rq->avg_vruntime
>
> When NGINX server is involuntarily switched out, it cannot accept any
> incoming request, leading to longer turn around time for the clients and
> thus loss in DeathStarBench throughput.
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized latency
> Interpretation: Lower is better
> Statistic : Mean
> ==================================================================
> tip 1.00
> eevdf 1.14 (+14.61%)
>
> For current running task, skip eligibility check in pick_eevdf() if it
> has not exhausted the slice promised to it during selection despite the
> situation having changed since. The behavior is guarded by
> RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
> RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
> since the merge of EEVDF disappears. Following are the results from
> testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip run-to-parity-wakeup(pct imp)
> 1CCD 1 1.00 1.16 (%diff: 16%)
> 2CCD 2 1.00 1.03 (%diff: 3%)
> 4CCD 4 1.00 1.12 (%diff: 12%)
> 8CCD 8 1.00 1.05 (%diff: 6%)
>
> With spec_rstack_overflow=off, the DeathStarBench performance with the
> proposed solution is same as the performance on v6.5 release before
> EEVDF was merged.

Thanks for sharing this Prateek.
We actually noticed we could also gain performance by disabling
eligibility checks (but disable it on all paths).
The following are a few threads we had on the topic:

Discussion around eligibility:
https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@xxxxxxxxxxxxxx/
Some of our results:
https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@xxxxxxxxxxxxxx/
Sched feature to disable eligibility:
https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromiumorg/

>
> This may lead to newly waking task waiting longer for its turn on the
> CPU, however, testing on the same system did not reveal any consistent
> regressions with the standard benchmarks.
>
> Link: https://github.com/delimitrou/DeathStarBench/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> ---
> kernel/sched/fair.c | 24 ++++++++++++++++++++----
> kernel/sched/features.h | 1 +
> 2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a9b145a4eab0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> *
> * Which allows tree pruning through eligibility.
> */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
> {
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (cfs_rq->nr_running == 1)
> return curr && curr->on_rq ? curr : se;
>
> - if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> + if (curr && !curr->on_rq)
> + curr = NULL;
> +
> + /*
> + * When an entity with positive lag wakes up, it pushes the
> + * avg_vruntime of the runqueue backwards. This may causes the
> + * current entity to be ineligible soon into its run leading to
> + * wakeup preemption.
> + *
> + * To prevent such aggressive preemption of the current running
> + * entity during task wakeups, skip the eligibility check if the
> + * slice promised to the entity since its selection has not yet
> + * elapsed.
> + */
> + if (curr &&
> + !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
> + !entity_eligible(cfs_rq, curr))
> curr = NULL;
>
> /*
> @@ -5460,7 +5476,7 @@ pick_next_entity(struct cfs_rq *cfs_rq)
> cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> return cfs_rq->next;
>
> - return pick_eevdf(cfs_rq);
> + return pick_eevdf(cfs_rq, false);
> }
>
> static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -8340,7 +8356,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> /*
> * XXX pick_eevdf(cfs_rq) != se ?
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (pick_eevdf(cfs_rq, true) == pse)
> goto preempt;
>
> return;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..027bab5b4031 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
> SCHED_FEAT(PLACE_LAG, true)
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)
>
> /*
> * Prefer to schedule the task we woke last (assuming it failed
> --
> 2.34.1
>