Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.

From: Vincent Guittot
Date: Fri Jul 21 2023 - 04:34:05 EST


On Fri, 21 Jul 2023 at 04:59, Kenan.Liu <Kenan.Liu@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Peter, thanks for your attention,
>
> please refer to my answer to your question inline:
>
>
> 在 2023/7/20 下午4:50, Peter Zijlstra 写道:
> > On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
> >> From: "Kenan.Liu" <Kenan.Liu@xxxxxxxxxxxxxxxxx>
> >>
> >> Multithreading workloads in VM with Qemu may encounter an unexpected
> >> phenomenon: one hyperthread of a physical core is busy while its sibling
> >> is idle. Such as:
> > Is this with vCPU pinning? Without that, guest topology makes no sense
> > what so ever.
>
>
> vCPU is pinned on host and the imbalance phenomenon we observed is inside
> VM, not for the vCPU threads on host.
>
>
> >> The main reason is that hyperthread index is consecutive in qemu native x86 CPU
> >> model which is different from the physical topology.
> > I'm sorry, what? That doesn't make sense. SMT enumeration is all over
> > the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
> > always (n,n+1) IIRC.
> >
> >> As the current kernel scheduler
> >> implementation, hyperthread with an even ID number will be picked up in a much
> >> higher probability during load-balancing and load-deploying.
> > How so?
>
>
> The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
> but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
> n means the total core number of the machine.
>
> The imbalance happens when the number of runnable threads is less
> than the number of hyperthreads, select_idle_core() would be called
> to decide which cpu be selected to run the waken-up task.
>
> select_idle_core() will return the checked cpu number if the whole
> core is idle. On the contrary, if any one HT of the core is busy,
> select_idle_core() would clear the whole core out from cpumask and
> check the next core.
>
> select_idle_core():
>
> if (idle)
> return core;
>
> cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> return -1;
>
> In this manner, except the very beginning of for_each_cpu_wrap() loop,
> HT with even ID number is always be checked at first, and be returned
> to the caller if the whole core is idle, so the odd indexed HT almost
> has no chance to be selected.
>
> select_idle_cpu():
>
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>
> And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
> when the loop starts from the bottom half of SMT number, HTs with larger
> number will be checked first, when it starts from the top half, their
> siblings with smaller number take the first place of inner core searching.

But why is it a problem ? Your system is almost idle and 1 HT per core
is used. Who cares to select evenly one HT or the other as long as we
select an idle core in priority ?

This seems related to
https://lore.kernel.org/lkml/BYAPR21MB1688FE804787663C425C2202D753A@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
we concluded that it was not a problem

>
>
> >
> >> This RFC targets to solve the problem by adjusting CFS loabalance policy:
> >> 1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
> >> with qemu native CPU topology.
> >> 2. Export a procfs to control the traverse length when select idle cpu.
> >>
> >> Kenan.Liu (2):
> >> sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
> >> topology.
> >> sched/fair: Export a param to control the traverse len when select
> >> idle cpu.
> > NAK, qemu can either provide a fake topology to the guest using normal
> > x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
> > quite insane.
> Thanks,
>
> Kenan.Liu