Re: [PATCH] sched/documentation: elaborate on uclamp limitations

From: Dietmar Eggemann
Date: Tue May 23 2023 - 10:39:34 EST


On 23/05/2023 11:23, Vincent Guittot wrote:
> On Thu, 18 May 2023 at 14:42, Hongyan Xia <hongyan.xia2@xxxxxxx> wrote:
>>
>> Hi Qais,
>>
>> On 2023-05-18 12:30, Qais Yousef wrote:
>>> Please CC sched maintainers (Ingo + Peter) next time as they should pick this
>>> up ultimately and they won't see it from the list only.
>>
>> Will do. I was using the get_maintainers script and I thought that gave
>> me all the CCs.
>>
>>> On 05/05/23 16:24, Hongyan Xia wrote:

[...]

>>>> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
>>>> index 74d5b7c6431d..524df07bceba 100644
>>>> --- a/Documentation/scheduler/sched-util-clamp.rst
>>>> +++ b/Documentation/scheduler/sched-util-clamp.rst
>>>> @@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.
>>>>
>>>> p0->util_avg = 300 + small_error
>>>>
>>>> +The reason why util_avg is around 300 even though it runs for 900 at Fmin is:
>
> What does it mean running for 900 at Fmin ? util_avg is a ratio in the
> range [0:1024] without time unit
>
>>>> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
>>>> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
>>>> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
>>>> +
>>>> +::
>>>> +
>>>> + util_sum / clock period = util_avg
>
> I don't get the meaning of the formula above ? There is no "clock
> period" (although I'm not sure what it means here) involved when
> computing util_avg

I also didn't get this one. IMHO. the relation between util_avg and
util_sum is `divider = LOAD_AVG_MAX - 1024 + avg->period_contrib`. But
I can't see how this matters here.

The crucial point here is IMHO as long we have idle time (p->util_avg <
CPU (current) capacity) the util_avg will not raise to 1024 since at
wakeup util_avg will be only decayed (since the task was sleeping, i.e.
!!se->on_rq = 0). And we are scale invariant thanks to the functionality
in update_rq_clock_pelt() (which is executed when p is running).

The pelt clock update at this moment (wakeup) is setting clock_pelt to
clock_task since rq->curr is the idle task but IMHO that is not the
reason why p->util_avg behaves like this.

The moment `p->util_avg >= CPU (current) capacity` there is no idle time
left, i.e. no such `only decay` updates happens for p anymore (only
`accrue/decay` updates in tick) and the result is that p->util_avg goes
1024.

> Also, there is no linear relation between util_avg and Fmin/Fmax
> ratio. Fmin/Fmax ratio is meaningful in regards to the ratio between
> running time and period time of a periodic task. I understand the
> reference of pelt in this document as a quite simplified description
> of PELT so I'm not sure that adding a partial explanation will help.
> It will probably cause more confusion to people. The only thing that
> is sure, is that PELT expects some idle time to stay fully invariant
> for periodic task

+1 ... we have to be able to understand the code. BTW, schedutil.rst has
also paragraphs about PELT and `Frequency / CPU Invariance` and also
refers to kernel/sched/pelt.h:update_rq_clock_pelt() for details.

[...]