Re: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks

From: Hongyan Xia
Date: Mon Dec 11 2023 - 06:22:53 EST


On 10/12/2023 22:22, Qais Yousef wrote:
Hi Hongyan

On 12/08/23 10:42, Hongyan Xia wrote:

What is a big concern is a normal task and a uclamp_max task running on the
same rq. If the uclamp_max task is 1024 but capped by uclamp_max at the
lowest OPP, and the normal task has no uclamp but a duty cycle, then when

You mean util_avg is 1024 but capped to lowest OPP? uclamp_max is repeated but
couldn't decipher what you meant to write instead.

the normal task wakes up on the rq, it'll be the highest OPP. When it
sleeps, the ulamp_max is back and at the lowest OPP. This square-wave
problem to me is a much bigger concern than an infrequent spike. If
CONFIG_HZ is 1000, this square wave's frequency is 500 switching between

If the rq->util_avg is 1024, then for any task that is running, the requested
frequency should be max. If there's a task that is capped by uclamp_max, then
this task should not run at max.

So other tasks should have run at max; why you don't want them to run at max?

Because it saves power. If there's a 1024 task capped at 300 and a true 300 task without uclamp on the same rq, there's no need to run the CPU at more than 600. Running it at 1024 ignores the uclamp_max and burns battery when it's not needed.

highest and lowest OPP, which is definitely unacceptable.

How come so definitive? How did you reach this definitive conclusion?

You are right. After talking to our firmware and silicon engineers they don't think switching between the highest and lowest OPP 500 times a second can have damaging effects, so I retract the 'unacceptable' comment.

The problem I think with filtering is, under this condition, should we
filter out the lowest OPP or the highest? Neither sounds like a good answer
because neither is a short-running task and the correct answer might be
somewhere in between.

We only ignore uclamp requirement with the filter. schedutil is drive by the rq
utilization signal as normal. It is only the fact should we apply
uclamp_min/max or not.

It seems you think we need to modify the rq->util_avg. And this should not be
the case. uclamp should not change how PELT accounting works; just modify how
some decisions based on it are taken.

I agree, uclamp shouldn't change PELT, but my series doesn't. Just like util_est which boosts util_avg, my patches don't touch util_avg but simply introduces util_min, util_max on the side of util_avg. I fail to see why it's okay for util_est to bias util_avg but not okay for me to do so. If this is the case, then the 'util_guest' proposal should also be right out rejected on the same ground.

It is true there's a corner case where util_avg could be wrong under the
documented limitation. But this is not related to max-aggregation and its
solution needs some creativity in handling pelt accounting under these
conditions.

Generally; capping that hard stirs question why userspace is doing this. We
don't want to cripple tasks, but prevent them from consuming inefficient energy
points. Otherwise they should make adequate progress. I'd expect uclamp_max to
be more meaningful for tasks that actually need to run at those higher
expensive frequencies.

So the corner case warrants fixing, but it is not a nuance in practice yet. And
it is a problem of failing to calculate the stolen idle time as we don't have
any under these corner conditions (Vincent can make a more accurate statement
than me here). It has nothing to do with how to handle performance requirements
of multiple RUNNABLE tasks.

Sorry to ramble on this again and again, but I think filtering is addressing
the symptom, not the cause. The cause is we have no idea under what
condition a util_avg was achieved. The 1024 task in the previous example
would be much better if we extend it into

I think the other way around :-) I think you're mixing the root cause of that
limitation with how uclamp hints for multiple tasks should be handled - which
is what is being fixed here.

I wrote the documentation and did the experiments to see how PELT behaves under
extreme conditions. And it says *it can break PELT*.

[1024, achieved at uclamp_min 0, achieved at uclamp_max 300]

Why you think this is the dominant use case? And why do you think this is
caused by max-aggregation? This is a limitation of PELT accounting and has
nothing to do with max-aggregation which is how multiple uclamp hints for
RUNNABLE tasks are handled.

Have you actually seen it practice? We haven't come across this problem yet. We
just want to avoid using expensive OPPs, but capping too had is actually
a problem as it can cause starvation for those tasks.

Is it only the documentation what triggered this concern about this corner
case? I'm curious what have you seen.

This is not a corner case. Having a uclamp_max task and a normal task on the same rq is fairly common. My concern isn't the 'frequency spike' problem from documentation. My concern comes from benchmarks, which is high-frequency square waves. An occasional spike isn't worrying, but the square waves are.

If we know 1024 was done under uclamp_max of 300, then we know we don't need
to raise to the max OPP. So far, we carry around a lot of different new
variables but not these two which we really need.

This problem is independent of how uclamp hint of multiple tasks should be
accounted for by the governor. This is a limitation of how PELT accounting
works. And to be honest, if you think more about it, this 300 tasks is already
a 1024 on current littles that has a capacity of 200 or less. And the capacity
of the mids at lowest OPP usually starts at a capacity around 100 or something.
Can't see it hit this problem while running on middle. I think this 300 tasks
will already run near lowest OPP at the big even without uclamp_max being
0 - it is that small for it.

So not sure on what systems you saw this problem on, and whether at all this is
a problem in practice. Like priority/nice and other sched attributes; you can
pick a wrong combination and shoot your self in the foot.

As I put in the documentation, this limitation will only hit if the actual task
capacity reaches some magical ratio. I'd expect practically these tasks to
still see idle time and get their util_avg corrected eventually.

Like in the previous comment, it's square waves that happen 500 times a second I saw in benchmarks that's worrying, not the occasional spike in documentation. I doubt we can say that a uclamp_max task and a normal task running on the same rq is a corner case.

So worth a fix, not related to handling performance requests for multiple
tasks, and not urgently needed as nothing is falling apart because of it for
the time being at least.

Also, I think there's still an unanswered question. If there's a 1024 task with a uclamp_min of 300 and a 300-util task with default uclamp on the same rq, which currently under max aggregation switches between highest and lowest OPP, should we filter out the high OPP or the low one? Neither is a short-running task. We could designate this as a corner case (though I don't think so), but wouldn't it be nice if we don't have any of these problems in the first place?

Hongyan