Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Vincent Guittot
Date: Tue Jul 18 2023 - 12:02:20 EST


On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
>
> When a workload involves many wake time task migrations, tg->load_avg
> can be heavily contended among CPUs because every migration involves
> removing the task's load from its src cfs_rq and attach that load to
> its new cfs_rq. Both the remove and attach requires an update to
> tg->load_avg as well as propagating the change up the hierarchy.
>
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions. Since the workload can trigger many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. For the above said workload, the profile shows
> update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> migration number is 14millions; update_cfs_group() costs ~25% and
> update_load_avg() costs ~16%.
>
> This patch is an attempt to reduce the cost of accessing tg->load_avg.

here you mention tg->load_avg which is updated with update_tg_load_avg()

but your patch below modifies the local update of cfs's util_avg,
runnable_avg and load_avg which need to be maintained up to date

You should be better to delay or rate limit the call to
update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
access tg->load_avg and are the culprit instead of modifying other
place.

Have you tried to remove update_cfs_group() from enqueue/dequeue and
only let the tick update the share periodically ?

Have you tried to make update_tg_load_avg() return early ? the change
of cfs' load_avg is written in the tg->load_avg only when the change
is bigger than 1.5%. maybe increase it to 6% ?

Or like for update_cfs_group, remove it from attach/detach entity and
let the periodic update to propagate the change

But please don't touch local update of *_avg

>
> Current logic will immediately do a update_tg_load_avg() if cfs_rq has
> removed load; this patch changes this behavior: if this cfs_rq has
> removed load as discovered by update_cfs_rq_load_avg(), it didn't call
> update_tg_load_avg() or propagate the removed load immediately, instead,
> the update to tg->load_avg and propagated load can be dealed with by a
> following event like task attached to this cfs_rq or in
> update_blocked_averages(). This way, the call to update_tg_load_avg()
> for this cfs_rq and its ancestors can be reduced by about half.
>
> ================================================
> postgres_sysbench(transaction, higher is better)
> nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
> results that have a measuable difference are:
>
> nr_thread=100% on SPR:
> base: 90569.11±1.15%
> node: 104152.26±0.34% +15.0%
> delay: 127309.46±4.25% +40.6%
>
> nr_thread=75% on SPR:
> base: 100803.96±0.57%
> node: 107333.58±0.44% +6.5%
> delay: 124332.39±0.51% +23.3%
>
> nr_thread=75% on ICL:
> base: 61961.26±0.41%
> node: 61585.45±0.50%
> delay: 72420.52±0.14% +16.9%
>
> =======================================================================
> hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
> group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
> results that have a measuable difference are:
>
> group=8 on SPR:
> base: 437163±2.6%
> node: 471203±1.2% +7.8%
> delay: 490780±0.9% +12.3%
>
> group=16 on SPR:
> base: 468279±1.9%
> node: 580385±1.7% +23.9%
> delay: 664422±0.2% +41.9%
>
> ================================================
> netperf/TCP_STRAM (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have a measuable difference are:
>
> nr_thread=50% on CSL:
> base: 16258±0.7%
> node: 16172±2.9%
> delay: 17729±0.7% +9.0%
>
> nr_thread=75% on CSL:
> base: 12923±1.2%
> node: 13011±2.2%
> delay: 15452±1.6% +19.6%
>
> nr_thread=75% on SPR:
> base: 16232±11.9%
> node: 13962±5.1%
> delay: 21089±0.8% +29.9%
>
> nr_thread=100% on SPR:
> base: 13220±0.6%
> node: 13113±0.0%
> delay: 18258±11.3% +38.1%
>
> =============================================
> netperf/UDP_RR (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have measuable difference are:
>
> nr_thread=1 on CSL:
> base: 128521±0.5%
> node: 127935±0.6%
> delay: 126317±0.4% -1.7%
>
> nr_thread=75% on CSL:
> base: 36701±1.7%
> node: 39949±1.4% +8.8%
> delay: 42516±0.3% +15.8%
>
> nr_thread=75% on SPR:
> base: 14249±3.8%
> node: 19890±2.0% +39.6%
> delay: 31331±0.5% +119.9%
>
> nr_thread=100% on CSL:
> base: 52275±0.6%
> node: 53827±0.4% +3.0%
> delay: 78386±0.7% +49.9%
>
> nr_thread=100% on SPR:
> base: 9560±1.6%
> node: 14186±3.9% +48.4%
> delay: 20779±2.8% +117.4%
>
> Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx>
> ---
> kernel/sched/fair.c | 23 ++++++++++++++++++-----
> kernel/sched/sched.h | 1 +
> 2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aceb8f5922cb..564ffe3e59c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (child_cfs_rq_on_list(cfs_rq))
> return false;
>
> + if (cfs_rq->prop_removed_sum)
> + return false;
> +
> return true;
> }
>
> @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> {
> cfs_rq->propagate = 1;
> cfs_rq->prop_runnable_sum += runnable_sum;
> +
> + if (cfs_rq->prop_removed_sum) {
> + cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
> + cfs_rq->prop_removed_sum = 0;
> + }
> }
>
> /* Update task and its cfs_rq load average */
> @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> * removed_runnable is the unweighted version of removed_load so we
> * can use it to estimate removed_load_sum.
> */
> - add_tg_cfs_propagate(cfs_rq,
> - -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> - decayed = 1;
> + cfs_rq->prop_removed_sum +=
> + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
> }
>
> - decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> + decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> u64_u32_store_copy(sa->last_update_time,
> cfs_rq->last_update_time_copy,
> sa->last_update_time);
> @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
>
> if (cfs_rq == &rq->cfs)
> decayed = true;
> +
> + /*
> + * If the aggregated removed_sum hasn't been taken care of,
> + * deal with it now before this cfs_rq is removed from the list.
> + */
> + if (cfs_rq->prop_removed_sum)
> + add_tg_cfs_propagate(cfs_rq, 0);
> }
>
> /* Propagate pending load changes to the parent, if any: */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9cece2dbc95b..ab540b21d071 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -619,6 +619,7 @@ struct cfs_rq {
> unsigned long tg_load_avg_contrib;
> long propagate;
> long prop_runnable_sum;
> + long prop_removed_sum;
>
> /*
> * h_load = weight * f(tg)
> --
> 2.41.0
>