Re: [PATCH 2/2] cpufreq: Update CPU capacity reduction in store_scaling_max_freq()

From: Lukasz Luba
Date: Tue Oct 11 2022 - 06:25:56 EST

Next message: syzbot: "Re: [syzbot] memory leak in __get_metapage"
Previous message: Ulf Hansson: "Re: [PATCHv3 1/2] mmc: block: Remove error check of hw_reset on reset"
In reply to: Peter Zijlstra: "Re: [PATCH 2/2] cpufreq: Update CPU capacity reduction in store_scaling_max_freq()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/11/22 09:38, Peter Zijlstra wrote:

On Mon, Oct 10, 2022 at 11:46:29AM +0100, Lukasz Luba wrote:

+CC Daniel, since I have mentioned a few times DTPM

On 10/10/22 11:25, Peter Zijlstra wrote:

On Mon, Oct 10, 2022 at 11:12:06AM +0100, Lukasz Luba wrote:

BTW, those Android user space max freq requests are not that long,
mostly due to camera capturing (you can see a few in this file,
e.g. [1]).

It does what now ?!? Why is Android using this *at*all* ?

It tries to balance the power budget, before bad things happen
randomly (throttling different devices w/o a good context what's
going on). Please keep in mind that we have ~3 Watts total power
budget in a phone, while several devices might be suddenly used:
1. big CPU with max power ~3-3.5 Watts (and we have 2 cores on pixel6)
2. GPU with max power ~6Watts (normally ~1-2Watts when lightly used)
3. ISP (Image Signal Processor) up to ~2Watts
4. DSP also up to 1-2Watts

We don't have currently a good mechanism which could be aware
of the total power/thermal budget and relations between those
devices. Vendors and OEMs run experiments on devices and profile
them to work more predictable in those 'important to users' scenarios.

AFAIK Daniel Lescano is trying to help with this new interface
for PowerCap: DTMP. It might be use as a new interface for those known
scenarios like the camera snapshot. But that interface is on the list
that I have also mentioned - it's missing the notification mechanism
for the scheduler reduced capacity due to user-space new scenario.

DTMP is like IPA but including random devices? Because I thought IPA
already did lots of this.

The DTMP is a kernel interface for power split which happen in the user
space policy. It exposes the sysfs to set those scenarios, even before
(like those Android 'powerhints') the power/thermal issue occur. I have
been reviewing it (and advocating internally). There is more work to
do there still and AFAIK is not yet used by Android.

IPA contains the policy to power budget split, but misses this 'context'
of what's going on and would happen. It has some PID mechanism to fix
itself, but it's not a silver bullet.

Furthermore, there are other IPA fundamental issues:
1. You might recall we added last year to IPA the utilization signal
of the CPU runqueues. That model still has issues with input
power estimation and I have described that here [1].
2. Cpu frequency sampling issue (we assume const. freq at whole period)
(also in [1])
3. Power consumption of a CPU at the same frequency varies and depends
on workload instruction mix, e.g. heavy SIMD floating-point code
for some image filter in camera app drains more power vs. a code
which is a garbage-collector background thread traversing a graph
in memory and has big backend stall due to randomness of pointers
(or a game thread for collision detection on octrees).
Our Energy Model doesn't cover such thing (yet).
The issue become more severe for us with last year available big
cores: a new generation of uArch Cortex-X1. They are able to
drain 3.5W instantly, while in Energy Model we have 2.2W for max
freq. In previous big cores we haven't such power hungry CPUs.
A fair assumption was 1.0W for EM value and 1.7W for a pick power
in some SIMD code. That 3.5W-2.2W can heat up the SoC really
quickly and use the free thermal budget easily. So hints from
user space are welcome IMO.
4. User space restriction to cpufreq and devfreq, which are those
'powerhints' about possible coming soon scenarios, are not taken into
account, due to missing interface. I have mentioned it ~2 years ago
and sent a RFC example patch for devfreq (didn't dare to address
cpufreq at once) [2]
5. Thermal-pressure PELT signal converges slowly to the original
instant signal set by thermal governor, so the capacity_of()
has delays to 'observe' the reality of the capped CPUs. In those
user space scenario short hints is important. I have tried to
add a mechanism to react faster, since we might already have
delays in our FW or IPA to the original signal. Patch didn't
make any progress on LKML [3].
6. The leakage. Rising temperature above normal values, causing higher
power drain by the CPU core. Presented on LPC 2022 [4]. This is an
issue when our GPU or ISP heats up the SoC, thus CPUs.

If you like, I can give you more details how those different CPUs
(and other devices) behave under power/thermal stress in various
scenarios. I have spent a lot of time in last ~5years on researching
it.

Regards,
Lukasz

[1] https://lore.kernel.org/linux-pm/20220406220809.22555-1-lukasz.luba@xxxxxxx/
[2] https://lore.kernel.org/lkml/20210126104001.20361-1-lukasz.luba@xxxxxxx/
[3] https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@xxxxxxx/
[4] https://lpc.events/event/16/contributions/1341/

Next message: syzbot: "Re: [syzbot] memory leak in __get_metapage"
Previous message: Ulf Hansson: "Re: [PATCHv3 1/2] mmc: block: Remove error check of hw_reset on reset"
In reply to: Peter Zijlstra: "Re: [PATCH 2/2] cpufreq: Update CPU capacity reduction in store_scaling_max_freq()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]