Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

From: Vincent Guittot
Date: Tue Dec 08 2020 - 11:15:32 EST


On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> After select_idle_sibling, p->recent_used_cpu is set to the
> new target. However on the next wakeup, prev will be the same as
> recent_used_cpu unless the load balancer has moved the task since the last
> wakeup. It still works, but is less efficient than it can be after all
> the changes that went in since that reduce unnecessary migrations, load
> balancer changes etc. This patch preserves recent_used_cpu for longer.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
> 5.10.0-rc6 5.10.0-rc6
> baseline-v2 altrecent-v2
> Hmean 1 508.39 ( 0.00%) 502.05 * -1.25%*
> Hmean 2 986.70 ( 0.00%) 983.65 * -0.31%*
> Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%*
> Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%*
> Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%*
> Hmean 32 10142.57 ( 0.00%) 10253.73 * 1.10%*
> Hmean 64 14348.40 ( 0.00%) 12506.31 * -12.84%*
> Hmean 128 21842.59 ( 0.00%) 21967.13 * 0.57%*
> Hmean 256 20813.75 ( 0.00%) 21534.52 * 3.46%*
> Hmean 320 20684.33 ( 0.00%) 21070.14 * 1.87%*
>
> The different was marginal except for 64 threads which showed in the
> baseline that the result was very unstable where as the patch was much
> more stable. This is somewhat machine specific as on a separate 80-cpu
> Broadwell machine the same test reported.
>
> 5.10.0-rc6 5.10.0-rc6
> baseline-v2 altrecent-v2
> Hmean 1 310.36 ( 0.00%) 291.81 * -5.98%*
> Hmean 2 340.86 ( 0.00%) 547.22 * 60.54%*
> Hmean 4 912.29 ( 0.00%) 1063.21 * 16.54%*
> Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%*
> Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%*
> Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%*
> Hmean 64 11733.91 ( 0.00%) 11473.66 * -2.22%*
> Hmean 128 17727.24 ( 0.00%) 16784.23 * -5.32%*
> Hmean 256 16089.23 ( 0.00%) 16110.79 * 0.13%*
> Hmean 320 15992.60 ( 0.00%) 16071.64 * 0.49%*
>
> schedstats were not used in this series but from an earlier debugging
> effort, the schedstats after the test run were as follows;
>
> Ops SIS Search 5653107942.00 5726545742.00
> Ops SIS Domain Search 3365067916.00 3319768543.00
> Ops SIS Scanned 112173512543.00 99194352541.00
> Ops SIS Domain Scanned 109885472517.00 96787575342.00
> Ops SIS Failures 2923185114.00 2950166441.00
> Ops SIS Recent Used Hit 56547.00 118064916.00
> Ops SIS Recent Used Miss 1590899250.00 354942791.00
> Ops SIS Recent Attempts 1590955797.00 473007707.00
> Ops SIS Search Efficiency 5.04 5.77
> Ops SIS Domain Search Eff 3.06 3.43
> Ops SIS Fast Success Rate 40.47 42.03
> Ops SIS Success Rate 48.29 48.48
> Ops SIS Recent Success Rate 0.00 24.96
>
> First interesting point is the ridiculous number of times runqueues are
> enabled -- almost 97 billion times over the course of 40 minutes
>
> With the patch, "Recent Used Hit" is over 2000 times more likely to
> succeed. The failure rate also increases by quite a lot but the cost is
> marginal even if the "Fast Success Rate" only increases by 2% overall. What
> cannot be observed from these stats is where the biggest impact as these
> stats cover low utilisation to over saturation.
>
> If graphed over time, the graphs show that the sched domain is only
> scanned at negligible rates until the machine is fully busy. With
> low utilisation, the "Fast Success Rate" is almost 100% until the
> machine is fully busy. For 320 clients, the success rate is close to
> 0% which is unsurprising.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

> ---
> kernel/sched/fair.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..413d895bbbf8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>
> /* Check a recently used CPU as a potential idle candidate: */
> recent_used_cpu = p->recent_used_cpu;
> + p->recent_used_cpu = prev;
> if (recent_used_cpu != prev &&
> recent_used_cpu != target &&
> cpus_share_cache(recent_used_cpu, target) &&
> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> asym_fits_capacity(task_util, recent_used_cpu)) {
> - /*
> - * Replace recent_used_cpu with prev as it is a potential
> - * candidate for the next wake:
> - */
> - p->recent_used_cpu = prev;
> return recent_used_cpu;
> }
>
> @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> - if (want_affine)
> - current->recent_used_cpu = cpu;
> }
> rcu_read_unlock();
>
> --
> 2.26.2
>