Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.

From: Kenan.Liu
Date: Thu Jul 20 2023 - 22:59:01 EST


Hi Peter, thanks for your attention,

please refer to my answer to your question inline:


在 2023/7/20 下午4:50, Peter Zijlstra 写道:
On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
From: "Kenan.Liu" <Kenan.Liu@xxxxxxxxxxxxxxxxx>

Multithreading workloads in VM with Qemu may encounter an unexpected
phenomenon: one hyperthread of a physical core is busy while its sibling
is idle. Such as:
Is this with vCPU pinning? Without that, guest topology makes no sense
what so ever.


vCPU is pinned on host and the imbalance phenomenon we observed is inside
VM, not for the vCPU threads on host.


The main reason is that hyperthread index is consecutive in qemu native x86 CPU
model which is different from the physical topology.
I'm sorry, what? That doesn't make sense. SMT enumeration is all over
the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
always (n,n+1) IIRC.

As the current kernel scheduler
implementation, hyperthread with an even ID number will be picked up in a much
higher probability during load-balancing and load-deploying.
How so?


The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
n means the total core number of the machine.

The imbalance happens when the number of runnable threads is less
than the number of hyperthreads, select_idle_core() would be called
to decide which cpu be selected to run the waken-up task.

select_idle_core() will return the checked cpu number if the whole
core is idle. On the contrary, if any one HT of the core is busy,
select_idle_core() would clear the whole core out from cpumask and
check the next core.

select_idle_core():
    …
    if (idle)
        return core;

    cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
    return -1;

In this manner, except the very beginning of for_each_cpu_wrap() loop,
HT with even ID number is always be checked at first, and be returned
to the caller if the whole core is idle, so the odd indexed HT almost
has no chance to be selected.

select_idle_cpu():
    …
    for_each_cpu_wrap(cpu, cpus, target + 1) {
        if (has_idle_core) {
            i = select_idle_core(p, cpu, cpus, &idle_cpu);

And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
when the loop starts from the bottom half of SMT number, HTs with larger
number will be checked first, when it starts from the top half, their
siblings with smaller number take the first place of inner core searching.



This RFC targets to solve the problem by adjusting CFS loabalance policy:
1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
with qemu native CPU topology.
2. Export a procfs to control the traverse length when select idle cpu.

Kenan.Liu (2):
sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
topology.
sched/fair: Export a param to control the traverse len when select
idle cpu.
NAK, qemu can either provide a fake topology to the guest using normal
x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
quite insane.
Thanks,

Kenan.Liu