Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

From: Shrikanth Hegde
Date: Fri Mar 08 2024 - 09:51:39 EST




On 3/8/24 4:53 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> wrote:
>
>> system is at 75% load <-- 25.6% contention
>> 113K probe:rebalance_domains_L37
>> 84K probe:rebalance_domains_L55
>>
>> 87
>> system is at 100% load <-- 87.5% contention.
>> 64K probe:rebalance_domains_L37
>> 8K probe:rebalance_domains_L55
>>
>>
>> A few reasons for contentions could be:
>>
>> 1. idle load balance is running and some other cpu is becoming idle, and
>> tries newidle_balance.
>>
>> 2. when system is busy, every CPU would do busy balancing, it would
>> contend for the lock. It will not do balance as should_we_balance says
>> this CPU need not balance. It bails out and release the lock.
>
> Thanks, these measurements are really useful!
>
> Would it be possible to disambiguate these two cases?

I think its not case 1, since newidle_balance doesnt even take that lock. So
likely its case 2.

>
> I think we should probably do something about this contention on this large
> system: especially if #2 'no work to be done' bailout is the common case.
>


I have been thinking would it be right to move this balancing trylock/atomic after
should_we_balance(swb). This does reduce the number of times this checked/updated
significantly. Contention is still present. That's possible at higher utilization
when there are multiple NUMA domains. one CPU in each NUMA domain can contend if their invocation
is aligned.

That makes sense since, Right now a CPU takes lock, checks if it can balance, do balance if yes and
then releases the lock. If the lock is taken after swb then also, CPU checks if it can balance,
tries to take the lock and releases the lock if it did. If lock is contended, it bails out of
load_balance. That is the current behaviour as well, or I am completely wrong.

Not sure in which scenarios that would hurt. we could do this after this series.
This may need wider functional testing to make sure we don't regress badly in some cases.
This is only an *idea* as of now.

Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
6.8-rc6
-----------------------------------------
idle system:
449 probe:rebalance_domains_L37
377 probe:rebalance_domains_L55
stress-ng --cpu=$(nproc) -l 51 << 51% load
88K probe:rebalance_domains_L37
77K probe:rebalance_domains_L55
stress-ng --cpu=$(nproc) -l 100 << 100% load
41K probe:rebalance_domains_L37
10K probe:rebalance_domains_L55

+below patch
----------------------------------------
idle system:
462 probe:load_balance_L35
394 probe:load_balance_L274
stress-ng --cpu=$(nproc) -l 51 << 51% load
5K probe:load_balance_L35 <<-- almost 15x less
4K probe:load_balance_L274
stress-ng --cpu=$(nproc) -l 100 << 100% load
8K probe:load_balance_L35
3K probe:load_balance_L274 <<-- almost 4x less


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62f247bdec86..3a8de7454377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11272,6 +11272,7 @@ static int should_we_balance(struct lb_env *env)
return group_balance_cpu(sg) == env->dst_cpu;
}

+static DEFINE_SPINLOCK(balancing);
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
@@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct rq *busiest;
struct rq_flags rf;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+ int need_serialize;
struct lb_env env = {
.sd = sd,
.dst_cpu = this_cpu,
@@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
goto out_balanced;
}

+ need_serialize = sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (!spin_trylock(&balancing))
+ goto lockout;
+ }
+
group = find_busiest_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11434,6 +11442,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
+ if (need_serialize)
+ spin_unlock(&balancing);
goto redo;
}
goto out_all_pinned;
@@ -11540,7 +11550,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
sd->balance_interval < MAX_PINNED_INTERVAL) ||
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
+
out:
+ if (need_serialize)
+ spin_unlock(&balancing);
+
+lockout:
return ld_moved;
}

@@ -11665,7 +11680,6 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}

-static DEFINE_SPINLOCK(balancing);

/*
* Scale the max load_balance interval with the number of CPUs in the system.
@@ -11716,7 +11730,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
+ int need_decay = 0;
u64 max_cost = 0;

rcu_read_lock();
@@ -11741,12 +11755,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

interval = get_sd_balance_interval(sd, busy);

- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (!spin_trylock(&balancing))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
/*
@@ -11760,9 +11768,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
- spin_unlock(&balancing);
-out:
+
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;