Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

From: Shrikanth Hegde
Date: Tue Mar 05 2024 - 06:12:32 EST




On 3/4/24 3:18 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
>
> 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
>

[...]

>
>
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>
> /*

Continuing the discussion related whether this balancing lock is
contended or not.


It was observed in large system (1920CPU, 16 NUMA Nodes) cacheline containing the
balancing trylock was contended and rebalance_domains was seen as part of the traces.

So did some experiments on smaller system. This system as 224 CPUs and 6 NUMA nodes.
Added probe points in rebalance_domains. If lock is not contended, then lock should
success and both probe points should match. If not, there should be contention.
Below are the system details and perf probe -L rebalance_domains.

NUMA:
NUMA node(s): 6
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-71
NUMA node4 CPU(s): 72-111
NUMA node5 CPU(s): 112-151
NUMA node6 CPU(s): 152-183
NUMA node7 CPU(s): 184-223


------------------------------------------------------------------------------------------------------------------
#perf probe -L rebalance_domains
<rebalance_domains@/shrikanth/sched_tip/kernel/sched/fair.c:0>
0 static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
2 int continue_balancing = 1;
3 int cpu = rq->cpu;
[...]


33 interval = get_sd_balance_interval(sd, busy);

need_serialize = sd->flags & SD_SERIALIZE;
36 if (need_serialize) {
37 if (!spin_trylock(&balancing))
goto out;
}

41 if (time_after_eq(jiffies, sd->last_balance + interval)) {
42 if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
48 idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
49 busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
}
51 sd->last_balance = jiffies;
52 interval = get_sd_balance_interval(sd, busy);
}
54 if (need_serialize)
55 spin_unlock(&balancing);
out:
57 if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
}

perf probe --list
probe:rebalance_domains_L37 (on rebalance_domains+856)
probe:rebalance_domains_L55 (on rebalance_domains+904)
------------------------------------------------------------------------------------------------------------------

Perf records are collected for 10 seconds in different system loads. load is created using stress-ng.
Contention is calculated as (1-L55/L37)*100

system is idle: <-- No contention
1K probe:rebalance_domains_L37
1K probe:rebalance_domains_L55


system is at 25% loa: <-- 4.4% contention
223K probe:rebalance_domains_L37: 1 chunks LOST!
213K probe:rebalance_domains_L55: 1 chunks LOST!



system is at 50% load <-- 12.5% contention
168K probe:rebalance_domains_L37
147K probe:rebalance_domains_L55


system is at 75% load <-- 25.6% contention
113K probe:rebalance_domains_L37
84K probe:rebalance_domains_L55

87
system is at 100% load <-- 87.5% contention.
64K probe:rebalance_domains_L37
8K probe:rebalance_domains_L55


A few reasons for contentions could be:
1. idle load balance is running and some other cpu is becoming idle, and tries newidle_balance.
2. when system is busy, every CPU would do busy balancing, it would contend for the lock. It will not do balance as
should_we_balance says this CPU need not balance. It bails out and release the lock.