Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

From: Chen Yu
Date: Fri Jun 16 2023 - 02:17:41 EST


Hi Gautham,
On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
>
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> >
> > For example, sqlite has spent quite some time on newidle balance()
> > on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
> >
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the LLC domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> >
>
> Is there any particular reason why this is being limited only to the
> LLC domain ?
>
Thank you for your interest in this change. The original thought as you
might know is that our LLC has a huge number of groups.
And since SIS_UTIL has done similar thing for LLC, we want to propose
a simple prototype to demonstrate the idea.
> On architectures where the LLC domain may not be so large (POWER9/10,
> AMD), the additional cost is usually paid at the higher domains where
> the number of groups is greater / equal to the number of groups in the
> LLC domain and where sd_span is pretty large. It would be good to
> explore avoiding the scan cost on those domains as well, right?
>
Exactly.
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
>
> If we can extend this logic to higher domains, on a Zen3 1 Socket
> server with 128 CPUs at the DIE domain containing 8 groups, we can
> expect a significant reduction in the time spent doing
> update_sg_lb_stats() at higher utilizations.
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-87.5%)
> 0.75 2 32 (-75%)
> 0.5 4 64 (-50%)
> 0.25 6 96 (-25%)
> 0.1 7 112 (-12.5%)
>
>
> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
> 12 groups, values will be:
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-91%)
> 0.75 3 48 (-75%)
> 0.5 6 96 (-50%)
> 0.25 9 144 (-25%)
> 0.1 10 160 (-16.7%)
>
Thanks for this information, I think we can extend the scan limitation
for not only LLC(MC domain) but also higher domains. I'll think about
it.

thanks,
Chenyu