Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system

From: Vishal Chourasia
Date: Wed Aug 09 2023 - 14:44:38 EST


On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote:
> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
>
> This is a minor optimization for a specific case of 100% utilization.
>
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this. It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
>
> Your feedback would really help. Does this optimization makes sense?
>
> Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
> return 1;
> }
>
> +#ifdef CONFIG_NO_HZ_COMMON
> + /* If the system is fully busy, its better to skip the idle checks */
> + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> + return group_balance_cpu(sg) == env->dst_cpu;
> +#endif
> +
> /* Try to find first idle CPU */
> for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
> if (!idle_cpu(cpu))
> --
> 2.31.1
>
Tested this patchset on top of v6.4

5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for
60 minutes.

stress-ng: 16CPUs, 23threads, 60mins

- 6.4.0

| completion time(sec) | user | sys |
|----------------------+-----------+------------|
| 3600.05 | 57582.44 | 0.70 |
| 3600.10 | 57597.07 | 0.68 |
| 3600.05 | 57596.65 | 0.47 |
| 3600.04 | 57596.36 | 0.71 |
| 3600.06 | 57595.32 | 0.42 |
| 3600.06 | 57593.568 | 0.596 | average
| 0.046904158 | 12.508392 | 0.27878307 | stddev

- 6.4.0+ (with patch)

| completion time(sec) | user | sys |
|----------------------+-----------+-------------|
| 3600.04 | 57596.58 | 0.50 |
| 3600.04 | 57595.19 | 0.48 |
| 3600.05 | 57597.39 | 0.49 |
| 3600.04 | 57596.64 | 0.53 |
| 3600.04 | 57595.94 | 0.43 |
| 3600.042 | 57596.348 | 0.486 | average
| 0.0089442719 | 1.6529610 | 0.072938330 | stddev

The average system time is slightly lower in the patched version (0.486 seconds)
compared to the 6.4.0 version (0.596 seconds).
The standard deviation for system time is also lower in the patched version
(0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more
consistent system time results with the patch.

vishal.c