Re: [PATCH 3/4] sched/fair: introduce core_vruntime and core_min_vruntime

From: Peter Zijlstra
Date: Wed Nov 15 2023 - 10:23:24 EST


On Wed, Nov 15, 2023 at 09:42:13PM +0800, cruzzhao wrote:
>
>
> 在 2023/11/15 下午8:20, Peter Zijlstra 写道:
> > On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
> >> To compare the priority of sched_entity from different cpus of a core,
> >> we introduce core_vruntime to struct sched_entity and core_min_vruntime
> >> to struct cfs_rq.
> >>
> >> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
> >> of the same task_group among the core, and se->core_vruntime is the
> >> vruntime relative to se->cfs_rq->core->core_min_vruntime.
> >
> > But that makes absolutely no sense. vruntime of different RQs can
> > advance at wildly different rates. Not to mention there's this random
> > offset between them.
> >
> > No, this cannot be.
>
> Force idle vruntime snapshot does the same thing, comparing
> sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
> cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.

But that subtracts a from a and b from b, it doesn't mix a and b.

Note that se->vruntime - cfs_rq->min_vruntime is a very poor
approximation of lag. We have actual lag now.

Note that:

(sea - seb) + (min_fib - min_fia) =
(sea - min_fia) + (min_fib - seb) =
(sea - min_fia) - (seb - min_fib) =
'lag'a - 'lag'b

It doesn't mix absolute a and b terms anywhere.

> Actually, cfs_rq->core->core_min_vruntime does the same thing as
> cfs_rq->min_vruntime_fi, providing a baseline, but
> cfs_rq->core->core_min_vruntime is more accurate.

min(cfs_rqa, cfs_rqb) is nonsense. And I can't see how min_vruntime_fi
would do anything like that.

> I've tried to implement a fair enough mechanism of core_vruntime, but
> it's too complex because of the weight, and it costs a lot. So this is a
> compromise solution.

'this' is complete nonsense and not motivated by any math.

> BTW, is there any other solutions to solve this problem?

Well, this is where it all started:

https://lkml.kernel.org/r/20200506143506.GH5298%40hirez.programming.kicks-ass.net

The above lag thing is detailed in a follow up:

https://lkml.kernel.org/r/20200515103844.GG2978%40hirez.programming.kicks-ass.net

Anyway, I think the first of those links has the start of the
multi-queue formalism, see the S_k+l bits. Work that out and see where
it ends.

I did go a bit further, but I've forgotten everything since, it's been
years.

Anyway, nothing like this goes in without a fairly solid bit of math in
the changelog to justify it.

Also, I think Joel complained about something like this at some point,
and he wanted to update the core tree more often, because IIRc his
observation was that things got stale or something.