Re: [PATCH 3/4] sched/fair: introduce core_vruntime and core_min_vruntime

From: cruzzhao
Date: Thu Nov 16 2023 - 21:48:48 EST




在 2023/11/15 下午11:22, Peter Zijlstra 写道:
> On Wed, Nov 15, 2023 at 09:42:13PM +0800, cruzzhao wrote:
>>
>>
>> 在 2023/11/15 下午8:20, Peter Zijlstra 写道:
>>> On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
>>>> To compare the priority of sched_entity from different cpus of a core,
>>>> we introduce core_vruntime to struct sched_entity and core_min_vruntime
>>>> to struct cfs_rq.
>>>>
>>>> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
>>>> of the same task_group among the core, and se->core_vruntime is the
>>>> vruntime relative to se->cfs_rq->core->core_min_vruntime.
>>>
>>> But that makes absolutely no sense. vruntime of different RQs can
>>> advance at wildly different rates. Not to mention there's this random
>>> offset between them.
>>>
>>> No, this cannot be.
>>
>> Force idle vruntime snapshot does the same thing, comparing
>> sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
>> cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.
>
> But that subtracts a from a and b from b, it doesn't mix a and b.
>
> Note that se->vruntime - cfs_rq->min_vruntime is a very poor
> approximation of lag. We have actual lag now.
>
> Note that:
>
> (sea - seb) + (min_fib - min_fia) =
> (sea - min_fia) + (min_fib - seb) =
> (sea - min_fia) - (seb - min_fib) =
> 'lag'a - 'lag'b
>
> It doesn't mix absolute a and b terms anywhere.
>
>> Actually, cfs_rq->core->core_min_vruntime does the same thing as
>> cfs_rq->min_vruntime_fi, providing a baseline, but
>> cfs_rq->core->core_min_vruntime is more accurate.
>
> min(cfs_rqa, cfs_rqb) is nonsense. And I can't see how min_vruntime_fi
> would do anything like that.
>

Introducing core_vruntime and core_min_vruntime is a try to maintain a
single core wide cfs_rq, abstracting vruntime, and core_min_vruntime
doesn't equal to min(cfs_rqa, cfs_rqb).

Note that:
sea->core_vruntime - seb->core_vruntime =
sea->core_vruntime - seb->core_vruntime + core_min_vruntime -
core_min_cruntime =
(sea->core_vruntime - core_min_vruntime) - (seb->core_vruntime -
core_min_vruntime) =
'lag'a - 'lag'b

The problem about wildly different vruntime rates also happens with
vruntime snapshot. Consider the case that a core always force idle some
SMT, and the min_vruntime_fi will never update. In this case, 'lag'a and
'lag'b increase according to their respective weights in cfs, instead of
the core wide weights.

Afaic, there is no perfect solution or mechanism to solve this problem
yet, but I'll try.

Best,
Cruz Zhao