Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

From: Dietmar Eggemann
Date: Mon Jun 05 2023 - 07:29:45 EST


On 31/05/2023 20:22, Qais Yousef wrote:
> Hi Lukasz!
>
> Sorry for late response..
>
> On 05/22/23 09:30, Lukasz Luba wrote:
>> Hi Qais,
>>
>> I have a question regarding the 'soft cpu affinity'.
>
> [...]
>
>>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
>>> is two folds:
>>>
>>> 1. Prevent tasks from consuming energy.
>>> 2. Keep them away from expensive CPUs.
>>>
>>> 2 is actually very important for 2 reasons:
>>>
>>> a. Because of max aggregation - any uncapped tasks that wakes up will
>>> cause a frequency spike on this 'expensive' cpu. We don't have
>>> a mechanism to downmigrate it - which is another thing I'm working
>>> on.
>>> b. It is desired to keep these bigger cpu idle ready for more important
>>> work.
>>>
>>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
>>> that we'd like to preserve for other type of work.
>>
>> I'm a bit afraid about such 'strong force'. That means the task would
>> not go via EAS if we set uclamp_max e.g. 90, while the little capacity
>> is 125. Or am I missing something?
>
> We should go via EAS, actually that's the whole point.
>
> Why do you think we won't go via EAS? The logic should be is we give a hint to
> prefer the little core, but we still can pick something else if it's more
> energy efficient.
>
> What uclamp_max enables us is to still consider that little core even if it's
> utilization says it doesn't fit there. We need to merge these patches first
> though as it's broken at the moment. if little capacity is 125 and utilization
> of the task is 125, then even if uclamp_max is 0, EAS will skip the whole
> little cluster as apotential candidate because there's no spare_capacity there.
> Even if the whole little cluster is idle.

I'm not against letting uclamp_max force fit the placement of p. I'm
just worried that using the EM (especially in it's current state) for
that is wrong and will only work in certain scenarios like the one you
picked.

I did show a counter-example (Juno-r0 [446 1024 1024 446 446 446] with 6
8ms/16ms uclamp_max=440). The issue is that once we have no idle time
left, the decisions of the EM are simply wrong and we shouldn't enforce
those decisions.

There is a related issue. E.g. on Pixel6 with its 4 little CPUs with
cpu_cap_orig = 124. If you admit a task with p->util_avg > cpu_cap_orig
it doesn't even fit onto a CPU. But we then do an energy calculation
taking the values of the whole little PD into consideration. This makes
no sense. The EM is not uclamp_max aware right now.

What about using `sis() -> sic()` for uclamp_max constrained tasks? We
would just have to iterate over the CPUs in cpu_cap_orig order. (1)

Or the EM has to be made uclamp_max aware. (2)

Your example is the idle system with 1 task p waking up. This task has a
util_avg which excludes the little CPUs from being the new task_cpu(p).
But p's uclamp_max value would allow this. But we can't just consider
the placement of 1 task here.

>> This might effectively use more energy for those tasks which can run on
>> any CPU and EAS would figure a good energy placement. I'm worried
>> about this, since we have L3+littles in one DVFS domain and the L3
>> would be only bigger in future.
>
> It's a bias that will enable the search algorithm in EAS to still consider the
> little core for big tasks. This bias will depend on the uclamp_max value chosen
> by userspace (so they have some control on how hard to cap the task), and what
> else is happening in the system at the time it wakes up.

To teach the EM about such tricky dependencies is IMHO outside the scope
of `how to select a CPU for a uclamp_max constrained task`. (3)

>> IMO to keep the big cpus more in idle, we should give them big energy
>> wake up cost. That's my 3rd feature to the EM presented in OSPM2023.
>
> Considering the wake up cost in EAS would be a great addition to have :)

I see this one as unrelated to (3) as well.

>>> Of course userspace has control by selecting the right uclamp_max value. They
>>> can increase it to allow a spill to next pd - or keep it low to steer them more
>>> strongly on a specific pd.
>>
>> This would we be interesting to see in practice. I think we need such
>> experiment, for such changes.
>
> I'm not sure what you mean by 'such changes'. I hope you don't mean these
> patches as they are not the key. They fix an obvious bug where task placement
> hint won't work at all. They don't modify any behavior that shouldn't have
> already been there. Nor introduce new limitation. I have to say I am
> disappointed that these patches aren't considered an important fix for an
> obvious breakage.

To me it's a dead-end to go this way. We need to see the full picture
including something like (1) or (2) or patches you have mentioned, like
the `down-migration in load-balance` thing.

Maybe we can at least list all the use cases for uclamp_max capping here:

It was mentioned:

(A) `soft affinity for tasks w/ util_avg > uclamp_max`.

Are there more?