Re: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering

From: Vincent Guittot
Date: Mon Jun 05 2023 - 05:30:58 EST


Hi Michael,

On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@xxxxxxxxxxxxx> wrote:
>
> With some CPU numbering schemes, the function select_idle_cpu() currently
> has a subtle bias to return the first hyper-thread in a core. As a result
> work is not evenly balanced across hyper-threads in a core. The difference
> is often as much as 15 to 20 percentage points -- i.e., the first
> hyper-thread might run at 45% load while the second hyper-thread runs at
> only 30% load or less.
>
> Two likely CPU numbering schemes make sense with today's typical case
> of two hyper-threads per core:
>
> A. Enumerate all the first hyper-theads in a core, then all the second
> hyper-threads in a core. If a system has 8 cores with 16 hyper-threads,
> CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
>
> B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> next core, etc. Again with 8 cores and 16 hyper-threads, CPUs #0 and
> #1 are in the same core, as are CPUs #2 and #3, etc.
>
> Scheme A is used in most ACPI bare metal systems and in VMs running on
> KVM. The enumeration order is determined by the order of the processor
> entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> be set up for scheme A.
>
> However, for reasons that pre-date the ACPI recommendation, Hyper-V
> guests have an ACPI MADT that is set up for scheme B. When using scheme B,
> the subtle bias is evident in select_idle_cpu(). While having Hyper-V
> conform to the ACPI spec recommendation would solve the Hyper-V problem,
> it is also desirable for the fair scheduler code to be independent of the
> CPU numbering scheme. ACPI is not always the firmware configuration
> mechanism, and CPU numbering schemes might vary more in architectures
> other than x86/x64.
>
> The bias occurs with scheme B when "has_idle_cpu" is true and

I assume that you mean has_idle_core as I can't find has_idle_cpu in the code

> select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> of where the loop starts, it will almost immediately encountered a CPU
> that is the first hyper-thread in a core. If that core is idle, the CPU
> number of that first hyper-thread is returned. If that core is not idle,
> both hyper-threads are removed from the cpus mask, and the loop iterates
> to choose another CPU that is the first hyper-thread in a core. As a
> result, select_idle_core() almost always returns the first hyper-thread
> in a core.
>
> The bias does not occur with scheme A because half of the CPU numbering
> space is a series of CPUs that are the second hyper-thread in all the
> cores. Assuming that the "target" CPU is evenly distributed throughout
> the CPU numbering space, there's a 50/50 chance of starting in the portion
> of the CPU numbering space that is all second hyper-threads. If
> select_idle_core() finds a idle core, it will likely return a CPU that
> is the second hyper-thread in the core. On average over the time,
> both the first and second hyper-thread are equally likely to be
> returned.
>
> Fix this bias by determining which hyper-thread in a core the "target"
> CPU is -- i.e., the "smt index" of that CPU. Then when select_idle_core()
> finds an idle core, it returns the CPU in the core that has the same
> smt index. If that CPU is not valid to be chosen, just return the CPU
> that was passed into select_idle_core() and don't worry about bias.
>
> With scheme B, this fix solves the bias problem by making the chosen
> CPU be roughly equally likely to either hyper-thread. With scheme A
> there's no real effect as the chosen CPU was already equally likely
> to be either hyper-thread, and still is.
>
> The imbalance in hyper-thread loading was originally observed in a
> customer workload, and then reproduced with a synthetic workload.
> The change has been tested with the synthetic workload in a Hyper-V VM
> running the normal scheme B CPU numbering, and then with the MADT
> replaced with a scheme A version using Linux's ability to override
> ACPI tables. The testing showed no change hyper-thread loading
> balance with the scheme A CPU numbering, but the imbalance is
> corrected if the CPU numbering is scheme B.

You failed to explain why it's important to evenly select 1st or 2nd
hyper-threads on the system. I don't see any performance figures.
What would be the benefit ?

>
> Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx>
> ---
>
> I haven't previously worked in Linux scheduler code, so I'm posting this
> as an RFC to point out the observed problem, and to suggest a solution.
> There may well be considerations in the design of a solution that I'm not
> aware of, so please educate me or suggest an alternative.
>
> It's also not completely clear whether an imbalance in hyper-thread
> loading is actually a problem. It looks weird, and causes customer
> concern when it is observed consistently across all cores in some
> production workload. The fair scheduler strives to balance load evenly, so
> I'm treating it as a problem that should be fixed, if for no other reason
> than general goodness. But again, I'm sure reviewers will feel free to
> tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's
> still a non-zero cost.
>
> FWIW, the same imbalance has been observed with kernels as far back as
> 5.4, and the root cause in the code is essentially the same. So it's not
> a recently introduced issue. I haven't tried anything earlier than 5.4.
>
> kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------
> 1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f..8b56e9d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu)
> return false;
> }
>
> +static inline int get_smt_index(int core)
> +{
> + int cpu, n = 0;
> +
> + for_each_cpu(cpu, cpu_smt_mask(core)) {
> + if (cpu == core)
> + return n;
> + n++;
> + }
> + /* If get here, cpu_smt_mask is set up incorrectly */
> + return 0;
> +}
> +
> /*
> * Scans the local SMT mask to see if the entire core is idle, and records this
> * information in sd_llc_shared->has_idle_cores.
> @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq)
> * there are no idle cores left in the system; tracked through
> * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> */
> -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> +static int select_idle_core(struct task_struct *p, int core, int smt_index,
> + struct cpumask *cpus, int *idle_cpu)
> {
> bool idle = true;
> - int cpu;
> + int cpu, index_cpu, n = 0;
>
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> if (!available_idle_cpu(cpu)) {
> @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> }
> if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> *idle_cpu = cpu;
> +
> + if (n++ == smt_index)
> + index_cpu = cpu;
> }
>
> if (idle)
> - return core;
> + return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core;
>
> cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> return -1;
> @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu)
> return false;
> }
>
> -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> +static inline int get_smt_index(int core)
> +{
> + return 0;
> +}
> +
> +static inline int select_idle_core(struct task_struct *p, int core, int smt_index,
> + struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core, p);
> }
> @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX;
> struct sched_domain_shared *sd_share;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
> }
>
> + smt_index = get_smt_index(target);
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> - i = select_idle_core(p, cpu, cpus, &idle_cpu);
> + i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu);
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
>
> --
> 1.8.3.1
>