Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Thu Jan 25 2024 - 20:46:12 EST


On 01/25/24 18:40, Vincent Guittot wrote:
> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > On 01/23/24 09:26, Vincent Guittot wrote:
> > > On Fri, 5 Jan 2024 at 23:20, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > > >
> > > > From: Qais Yousef <qais.yousef@xxxxxxx>
> > > >
> > > > If a misfit task is affined to a subset of the possible cpus, we need to
> > > > verify that one of these cpus can fit it. Otherwise the load balancer
> > > > code will continuously trigger needlessly leading the balance_interval
> > > > to increase in return and eventually end up with a situation where real
> > > > imbalances take a long time to address because of this impossible
> > > > imbalance situation.
> > >
> > > If your problem is about increasing balance_interval, it would be
> > > better to not increase the interval is such case.
> > > I mean that we are able to detect misfit_task conditions for the
> > > periodic load balance so we should be able to not increase the
> > > interval in such cases.
> > >
> > > If I'm not wrong, your problem only happens when the system is
> > > overutilized and we have disable EAS
> >
> > Yes and no. There are two concerns here:
> >
> > 1.
> >
> > So this patch is a generalized form of 0ae78eec8aa6 ("sched/eas: Don't update
> > misfit status if the task is pinned") which is when I originally noticed the
> > problem and this patch was written along side it.
> >
> > We have unlinked misfit from overutilized since then.
> >
> > And to be honest I am not sure if flattening of topology matters too since
> > I first noticed this, which was on Juno which doesn't have flat topology.
> >
> > FWIW I can still reproduce this, but I have a different setup now. On M1 mac
> > mini if I spawn a busy task affined to littles then expand the mask for
> > a single big core; I see big delays (>500ms) without the patch. But with the
> > patch it moves in few ms. The delay without the patch is too large and I can't
> > explain it. So the worry here is that generally misfit migration not happening
> > fast enough due to this fake misfit cases.
>
> I tried a similar scenario on RB5 but I don't see any difference with
> your patch. And that could be me not testing it correctly...
>
> I set the affinity of always running task to cpu[0-3] for a few
> seconds then extend it to [0-3,7] and the time to migrate is almost
> the same.

That matches what I do.

I write a trace_marker when I change affinity to help see when it should move.

>
> I'm using tip/sched/core + [0]
>
> [0] https://lore.kernel.org/all/20240108134843.429769-1-vincent.guittot@xxxxxxxxxx/

I tried on pinebook pro which has a rk3399 and I can't reproduce there too.

On the M1 I get two sched domains, MC and DIE. But on the pine64 it has only
MC. Could this be the difference as lb has sched domains dependencies?

It seems we flatten topologies but not sched domains. I see all cpus shown as
core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
seems the flatten topology stuff detect LLC correctly but still keeps the
sched-domains not flattened. Is this a bug? I thought we will end up with one
sched domain still.

TBH I had a bit of confirmation bias that this is a problem based on the fix
(0ae78eec8aa6) that we had in the past. So on verification I looked at
balance_interval and this reproducer which is a not the same as the original
one and it might be exposing another problem and I didn't think twice about it.

The patch did help though. So maybe there are more than one problem. The delays
are longer than I expected as I tried to highlight. I'll continue to probe.