Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance

From: Qais Yousef
Date: Mon Jul 24 2023 - 12:10:47 EST


On 07/24/23 14:58, Dietmar Eggemann wrote:
> On 22/07/2023 00:04, Qais Yousef wrote:
> > On 07/21/23 15:52, Vincent Guittot wrote:
> >> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
> >>> On 07/20/23 14:31, Vincent Guittot wrote:
> >>>
> >>>> I was trying to reproduce the behavior but I was failing until I
> >>>> realized that this code path is used when the 2 groups are not sharing
> >>>> their cache. Which topology do you use ? I thought that dynamiQ and
> >>>> shares cache between all 8 cpus was the norm for arm64 embedded device
> >>>> now
> >>>
> >>> Hmm good question. phantom domains didn't die which I think is what causing
> >>> this. I can look if this is for a good reason or just historical artifact.
> >>>
> >>>>
> >>>> Also when you say "the little cluster capacity is very small nowadays
> >>>> (around 200 or less)", it is the capacity of 1 core or the cluster ?
> >>>
> >>> I meant one core. So in my case all the littles were busy except for one that
> >>> was mostly idle and never pulled a task from mid where two tasks were stuck on
> >>> a CPU there. And the logs I have added were showing me that the env->imbalance
> >>> was on 150+ range but the task we pull was in the 350+ range.
> >>
> >> I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
> >> which is expected because we share cache and weight is 1 so we use the path
> >>
> >> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >> /*
> >> * When prefer sibling, evenly spread running tasks on
> >> * groups.
> >> */
> >> env->migration_type = migrate_task;
> >> env->imbalance = sibling_imbalance(env, sds, busiest, local);
> >> } else {
> >>
> >
> > I missed the deps on topology. So yes you're right, this needs to be addressed
> > first. I seem to remember Sudeep merged some stuff that will flatten these
> > topologies.
> >
> > Let me chase this topology thing out first.
>
> Sudeeps patches align topology cpumasks with cache cpumasks.
>
> tip/sched/core:
>
> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 3f
> 3f
> 3f
> 3f
> 3f
> 3f
>
> v5.9:
>
> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 39
> 06
> 06
> 39
> 39
> 39
>
> So Android userspace won't be able to detect uArch boundaries via
> `package_cpus` any longer.
>
> The phantom domain (DIE) in Android is a legacy decision from within
> Android. Pre-mainline Energy Model was attached to the sched domain
> topology hierarchy. And then IMHO other Android functionality start to
> rely on this. It could be removed regardless of Sudeeps patches in case
> Android would be OK with it.
>
> The phantom domain is probably set up via DT cpu_map entry:
>
> cpu-map {
> cluster0 { <-- enforce phantom domain
> core0 {
> cpu = <&CPU0>;
> };
> ...
> core3 {
> cpu = <&CPU1>;
> };
> cluster1 {
> ...
>
> Juno (arch/arm64/boot/dts/arm/juno.dts) also uses cpu-map to enforce
> uarch boundaries on DIE group boundary congruence.
>
> tip/sched/core:
>
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
>
> v5.9:
>
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
>
> We had a talk at LPC '22 about the influence of the patch-set and the
> phantom domain legacy issue:
>
> https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
>
> [...]

Thanks Dietmar!

So I actually moved everything to a single cluster and this indeed solves the
lb() issue. But then when I tried to look at DT mainline I saw that the DTs
still define separate cluster for each uArch, and this got me confused whether
I did the right thing or not. And made me wonder whether the fix is to change
DT or port Sudeep's/Ionela's patch?

I did some digging and I think the DT, like the ones in mainline by the look of
it, stayed the way it was historically defined.

So IIUC the impacts are on system pre-simplified EM (should have been phased
out AFAIK). And on different presentation on sysfs topology which can
potentially break userspace deps, right? I think this is not a problem too, but
can be famous last words as usual :-)


Thanks

--
Qais Yousef