Re: [RFC PATCH 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature

From: Chen Yu
Date: Thu Oct 19 2023 - 07:36:03 EST


On 2023-10-18 at 16:45:10 -0400, Mathieu Desnoyers wrote:
> Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
> selection picks the previous, target, or recent runqueues if they have
> enough remaining capacity to enqueue the task before scanning for an
> idle cpu.
>
> This feature is introduced in preparation for the SELECT_BIAS_PREV
> scheduler feature. Its performance benefits are noticeable when combined
> with the SELECT_BIAS_PREV feature.
>
> The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
> Those are performed on a v6.5.5 kernel with mitigations=off.
>
> The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
> Processor (over 2 sockets) keeps relatively the same wall time (49s).
>
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> We can observe that the number of migrations is reduced significantly
> with this patch (improvement):
>
> Baseline: 117M cpu-migrations (9.355 K/sec)
> With patch: 67M cpu-migrations (5.470 K/sec)
>
> The task-clock utilization is reduced (degradation):
>
> Baseline: 253.275 CPUs utilized
> With patch: 223.130 CPUs utilized
>
> The number of context-switches is increased (degradation):
>
> Baseline: 445M context-switches (35.516 K/sec)
> With patch: 581M context-switches (47.548 K/sec)
>
> So the improvement due to reduction of migrations is countered by the
> degradation in CPU utilization and context-switches. The following
> SELECT_BIAS_PREV feature will address this.
>
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@xxxxxxx/
> Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@xxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@xxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
> Cc: Chen Yu <yu.c.chen@xxxxxxxxx>
> Cc: Tim Chen <tim.c.chen@xxxxxxxxx>
> Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Cc: Gautham R . Shenoy <gautham.shenoy@xxxxxxx>
> Cc: x86@xxxxxxxxxx
> ---
> kernel/sched/fair.c | 49 ++++++++++++++++++++++++++++++++++++-----
> kernel/sched/features.h | 6 +++++
> kernel/sched/sched.h | 5 +++++
> 3 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d9c2482c5a3..8058058afb11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4497,6 +4497,37 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> trace_sched_util_est_se_tp(&p->se);
> }
>
> +/*
> + * Returns true if adding the task utilization to the estimated
> + * utilization of the runnable tasks on @cpu does not exceed the
> + * capacity of @cpu.
> + *
> + * This considers only the utilization of _runnable_ tasks on the @cpu
> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> + * using the runqueue util_est.enqueued, and by estimating the capacity
> + * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
> + * rather than capacity_of() because capacity_of() considers
> + * blocked/sleeping tasks in other scheduler classes.
> + *
> + * The utilization vs capacity comparison is done without the margin
> + * provided by fits_capacity(), because fits_capacity() is used to
> + * validate whether the utilization of a task fits within the overall
> + * capacity of a cpu, whereas this function validates whether the task
> + * utilization fits within the _remaining_ capacity of the cpu, which is
> + * more precise.
> + */
> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> + int cpu)
> +{
> + unsigned long total_util, capacity;
> +
> + if (!sched_util_fits_capacity_active())
> + return false;
> + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
> + capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);

scale_rt_capacity(cpu) could provide the remaining cpu capacity after substracted by
the side activity(rt tasks/thermal pressure/irq time), maybe it would be more accurate?

> + return total_util <= capacity;
> +}
> +
> static inline int util_fits_cpu(unsigned long util,
> unsigned long uclamp_min,
> unsigned long uclamp_max,
> @@ -7124,12 +7155,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> int i, recent_used_cpu;
>
> /*
> - * On asymmetric system, update task utilization because we will check
> - * that the task fits with cpu's capacity.
> + * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
> + * update task utilization because we will check that the task
> + * fits with cpu's capacity.
> */
> - if (sched_asym_cpucap_active()) {
> + if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
> sync_entity_load_avg(&p->se);
> task_util = task_util_est(p);
> + }
> + if (sched_asym_cpucap_active()) {
> util_min = uclamp_eff_value(p, UCLAMP_MIN);
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
> }
> @@ -7139,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> */
> lockdep_assert_irqs_disabled();
>
> - if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> + if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
> + task_fits_remaining_cpu_capacity(task_util, target)) &&

Compared to the previous version posted here[1], when the cpu's util_est is lower than 25% of CPU
capacity we choose the previous CPU, current version seems to be more aggressive.
it is possible that a short running task is queued on the near 100% busy cpu while there
is still an idle cpu in the system.

https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@xxxxxxxxxxxx/

thanks,
Chenyu