Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

From: Ingo Molnar
Date: Tue Mar 12 2024 - 06:58:04 EST

Next message: kernel test robot: "WARNING: modpost: "__ashrdi3" [fs/erofs/erofs.ko] has no CRC!"
Previous message: Vamsi Attunuru: "[PATCH v4 1/1] misc: mrvl-cn10k-dpi: add Octeon CN10K DPI administrative driver"
In reply to: Shrikanth Hegde: "Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> wrote:

> > I think we should probably do something about this contention on this
> > large system: especially if #2 'no work to be done' bailout is the
> > common case.
>
>
> I have been thinking would it be right to move this balancing
> trylock/atomic after should_we_balance(swb). This does reduce the number
> of times this checked/updated significantly. Contention is still present.
> That's possible at higher utilization when there are multiple NUMA
> domains. one CPU in each NUMA domain can contend if their invocation is
> aligned.

Note that it's not true contention: it simply means there's overlapping
requests for the highest domains to be balanced, for which we only have a
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can
> balance, do balance if yes and then releases the lock. If the lock is
> taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is
> contended, it bails out of load_balance. That is the current behaviour as
> well, or I am completely wrong.
>
> Not sure in which scenarios that would hurt. we could do this after this
> series. This may need wider functional testing to make sure we don't
> regress badly in some cases. This is only an *idea* as of now.
>
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
> 6.8-rc6
> -----------------------------------------
> idle system:
> 449 probe:rebalance_domains_L37
> 377 probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 88K probe:rebalance_domains_L37
> 77K probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 41K probe:rebalance_domains_L37
> 10K probe:rebalance_domains_L55
>
> +below patch
> ----------------------------------------
> idle system:
> 462 probe:load_balance_L35
> 394 probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 5K probe:load_balance_L35 <<-- almost 15x less
> 4K probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 8K probe:load_balance_L35
> 3K probe:load_balance_L274 <<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct rq *busiest;
> struct rq_flags rf;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> + int need_serialize;
> struct lb_env env = {
> .sd = sd,
> .dst_cpu = this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (!spin_trylock(&balancing))
> + goto lockout;
> + }
> +
> group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that
it allows the should_we_balance() check to be executed in parallel, and
will only try to take the NUMA-balancing flag if that function indicates an
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a
valid approach. What might make sense is to instrument the percentage of
NUMA-balancing flag-taking 'failures' vs. successful attempts - not
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the
cumulative execution time of these SD_SERIAL rebalance passes exceeds that
of 100% of single CPU time? Ie. a single CPU is basically continuously
scanning the scheduler data structures for imbalances, right? That doesn't
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large
that we exceed 100% CPU time?

Thanks,

Ingo

Next message: kernel test robot: "WARNING: modpost: "__ashrdi3" [fs/erofs/erofs.ko] has no CRC!"
Previous message: Vamsi Attunuru: "[PATCH v4 1/1] misc: mrvl-cn10k-dpi: add Octeon CN10K DPI administrative driver"
In reply to: Shrikanth Hegde: "Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]