Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.

From: luoben
Date: Mon Jul 24 2023 - 02:58:24 EST



On 2023/7/21 17:13, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
On Fri, Jul 21, 2023 at 10:33:44AM +0200, Vincent Guittot wrote:
> On Fri, 21 Jul 2023 at 04:59, Kenan.Liu <Kenan.Liu@xxxxxxxxxxxxxxxxx> wrote:

>> The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
>> but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
>> n means the total core number of the machine.
>>
>> The imbalance happens when the number of runnable threads is less
>> than the number of hyperthreads, select_idle_core() would be called
>> to decide which cpu be selected to run the waken-up task.
>>
>> select_idle_core() will return the checked cpu number if the whole
>> core is idle. On the contrary, if any one HT of the core is busy,
>> select_idle_core() would clear the whole core out from cpumask and
>> check the next core.
>>
>> select_idle_core():
>> …
>> if (idle)
>> return core;
>>
>> cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>> return -1;
>>
>> In this manner, except the very beginning of for_each_cpu_wrap() loop,
>> HT with even ID number is always be checked at first, and be returned
>> to the caller if the whole core is idle, so the odd indexed HT almost
>> has no chance to be selected.
>>
>> select_idle_cpu():
>> …
>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>> if (has_idle_core) {
>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>
>> And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
>> when the loop starts from the bottom half of SMT number, HTs with larger
>> number will be checked first, when it starts from the top half, their
>> siblings with smaller number take the first place of inner core searching.
>
> But why is it a problem ? Your system is almost idle and 1 HT per core
> is used. Who cares to select evenly one HT or the other as long as we
> select an idle core in priority ?

Right, why is this a problem? Hyperthreads are supposed to be symmetric,
it doesn't matter which of the two are active, the important thing is to
only have one active if we can.

(Unlike Power7, they have asymmetric SMT)


hi Peter and Vincent,

Some upper-level monitoring logic may take the CPU usage as a metric for
computing resource scaling. Imbalanced scheduling can create the illusion
of CPU resource scarcity, leading to more frequent triggering of resource
expansion by the upper-level scheduling system. However, this is actually
a waste of resources. So, we think this may be a problem.

Could you please take a further look at PATCH#2? We found that the default
'nr' value did not perform well under our scenario, and we believe that
adjustable variables would be more appropriate.

Our scenario is as follows:
16 processes are running in a 32 CPU VM, with 8 threads per process,
they are all running the same job.
The expected result is that the CPU usage is evenly distributed, but
we found that even-numbered cores were used for scheduling decisions
and consumed more CPU resources (5%~20%), mainly because of the default
value of nr=4. In this scenario, we found that nr=2 is more suitable.

Thanks,
Ben