Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Vincent Guittot
Date: Wed Jul 19 2023 - 05:09:44 EST


On Wed, 19 Jul 2023 at 07:18, Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
>
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
> > >
> > > When a workload involves many wake time task migrations, tg->load_avg
> > > can be heavily contended among CPUs because every migration involves
> > > removing the task's load from its src cfs_rq and attach that load to
> > > its new cfs_rq. Both the remove and attach requires an update to
> > > tg->load_avg as well as propagating the change up the hierarchy.
> > >
> > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > > migration number is 11millions. Since the workload can trigger many
> > > wakeups and migrations, the access(both read and write) to tg->load_avg
> > > can be unbound. For the above said workload, the profile shows
> > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > > migration number is 14millions; update_cfs_group() costs ~25% and
> > > update_load_avg() costs ~16%.
> > >
> > > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> >
> > here you mention tg->load_avg which is updated with update_tg_load_avg()
> >
> > but your patch below modifies the local update of cfs's util_avg,
> > runnable_avg and load_avg which need to be maintained up to date
>
> Yes, since it delayed propagating the removed load, the upper cfs_rqs'
> *_avg could be updated later than current code.
>
> > You should be better to delay or rate limit the call to
> > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> > access tg->load_avg and are the culprit instead of modifying other
> > place.
>
> Thanks for the suggestion and I think it makes perfect sense.
>
> I tried below to rate limit the update to tg->load_avg at most once per
> ms in update_tg_load_avg():
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> + u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> /*
> * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> if (cfs_rq->tg == &root_task_group)
> return;
>
> - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> + cfs_rq->last_update_tg_load_avg = now;
> }
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14dfaafb3a8f..b5201d44d820 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -594,6 +594,7 @@ struct cfs_rq {
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> unsigned long tg_load_avg_contrib;
> + u64 last_update_tg_load_avg;
> long propagate;
> long prop_runnable_sum;
>
> From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
> this alone already delivers good results. For postgres_sysbench, the two
> functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
> functions cost dropped to ~2% and ~4%. Togerther with some more rate
> limiting on the read side, I think the end result will be better. Does
> the above look reasonable to you?

Yes, that seems a better way to limit write access

>
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).
>
> >
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
>
> patch4 kind of did that :-)

That's what I have noticed when reading patch 4 :-)

>
> >
> > Have you tried to make update_tg_load_avg() return early ? the change
> > of cfs' load_avg is written in the tg->load_avg only when the change
> > is bigger than 1.5%. maybe increase it to 6% ?
>
> Yes, increase the delta is also a good way to limit the update to
> tg->load_avg. Will try that too.
>
> >
> > Or like for update_cfs_group, remove it from attach/detach entity and
> > let the periodic update to propagate the change
> >
> > But please don't touch local update of *_avg
>
> OK I see.
>
> Thanks again for the comments, they are very helpful.