Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Vincent Guittot
Date: Wed Jul 19 2023 - 05:12:54 EST


On Wed, 19 Jul 2023 at 10:11, Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
>
> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > > only let the tick update the share periodically ?
> >
> > patch4 kind of did that :-)
> >
>
> More about this.
>
> If I remove update_cfs_group() in dequeue_task_fair() on top of patch4
> like this:
>
> From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@xxxxxxxxx>
> Date: Wed, 19 Jul 2023 14:51:07 +0800
> Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in
> dequeue path
>
> ---
> kernel/sched/fair.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2adb6a6abbce..a21ab72819ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> se_update_runnable(se);
> - update_cfs_group(se);
>
> cfs_rq->h_nr_running--;
> cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> --
> 2.40.1
>
> Than P95 latency of the schbench workload I described in patch4's
> changelog will increase to > 1ms(base and patch4's P95 < 100us):
>
> Latency percentiles (usec) runtime 300 (s) (18504 total samples)
> 50.0th: 20 (9537 samples)
> 75.0th: 25 (4869 samples)
> 90.0th: 29 (2264 samples)
> 95.0th: 2564 (909 samples)
> *99.0th: 20768 (740 samples)
> 99.5th: 23520 (93 samples)
> 99.9th: 31520 (74 samples)
> min=6, max=40072
>
> If I further remove update_cfs_group() completely in enqueue path on top
> of the last change:
>
> From 4e4cb31590ca2e4080ece9cfa9dfaaf26501c60d Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@xxxxxxxxx>
> Date: Wed, 19 Jul 2023 15:36:24 +0800
> Subject: [PATCH 2/2] sched/fair: completely remove update_cfs_group() from
> enqueue path
>
> ---
> kernel/sched/fair.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a21ab72819ce..8fc325112282 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4847,8 +4847,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> */
> update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> se_update_runnable(se);
> - if (cfs_rq->nr_running > 0)
> - update_cfs_group(se);
> account_entity_enqueue(cfs_rq, se);
>
> if (flags & ENQUEUE_WAKEUP)
> @@ -6344,7 +6342,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> se_update_runnable(se);
> - update_cfs_group(se);
>
> cfs_rq->h_nr_running++;
> cfs_rq->idle_h_nr_running += idle_h_nr_running;
> --
> 2.40.1
>
> Then P50's latency will bump to ~4ms from ~20us:
> Latency percentiles (usec) runtime 300 (s) (17940 total samples)
> 50.0th: 3996 (12092 samples)
> 75.0th: 4004 (4919 samples)
> 90.0th: 4004 (0 samples)
> 95.0th: 4012 (353 samples)
> *99.0th: 20000 (487 samples)
> 99.5th: 20000 (0 samples)
> 99.9th: 31136 (72 samples)
> min=7, max=37402
> real 5m36.633s
> user 47m33.947s
> sys 4m47.097s
>
> So for the read side, maybe just keep what patch4 does?

yes, skipping update_cfs_group() at enqueue bypass the opportunity to
increase the share and get more running time for the group until the
update really happen

>
> Thanks,
> Aaron