Re: [PATCH] cpufreq: Change default transition delay to 2ms

From: Pierre Gondois
Date: Thu Feb 22 2024 - 10:16:08 EST


Hello Qais,

On 2/22/24 12:55, Qais Yousef wrote:
On 02/20/24 18:38, Pierre Gondois wrote:
Hello Qais,

I added some other remarks,

On 2/20/24 14:50, Qais Yousef wrote:
On 02/14/24 10:19, Pierre Gondois wrote:
Hello,

On 2/12/24 16:53, Rafael J. Wysocki wrote:
On Mon, Feb 5, 2024 at 8:45 AM Viresh Kumar <viresh.kumar@xxxxxxxxxx> wrote:

On 05-02-24, 02:25, Qais Yousef wrote:
10ms is too high for today's hardware, even low end ones. This default
end up being used a lot on Arm machines at least. Pine64, mac mini and
pixel 6 all end up with 10ms rate_limit_us when using schedutil, and
it's too high for all of them.

Change the default to 2ms which should be 'pessimistic' enough for worst
case scenario, but not too high for platforms with fast DVFS hardware.

Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
---
drivers/cpufreq/cpufreq.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 44db4f59c4cc..8207f7294cb6 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -582,11 +582,11 @@ unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy)
* for platforms where transition_latency is in milliseconds, it
* ends up giving unrealistic values.
*
- * Cap the default transition delay to 10 ms, which seems to be
+ * Cap the default transition delay to 2 ms, which seems to be
* a reasonable amount of time after which we should reevaluate
* the frequency.
*/
- return min(latency * LATENCY_MULTIPLIER, (unsigned int)10000);
+ return min(latency * LATENCY_MULTIPLIER, (unsigned int)(2*MSEC_PER_SEC));

Please add spaces around '*'.

Acked-by: Viresh Kumar <viresh.kumar@xxxxxxxxxx>

I've adjusted the whitespace as suggested above and applied the patch
as 5.9 material.

Thanks!


To add some numbers, on a Juno-r2, with latency measured between the frequency
request on the kernel side and the SCP actually making the frequency update.

The SCP is the firmware responsible of making the frequency updates. It receives
the kernel requests and coordinate them/make the actual changes. The SCP also has
a mechanism called 'fast channel' (FC) where the kernel writes the requested
frequency to a memory area shared with the SCP. Every 4ms, the SCP polls/reads
these memory area and make the required modifications.

Latency values (in ms)
Workload:
Idle system, during ~30s
+---------------------------------------+
| | Without FC | With FC |
+-------+---------------+---------------+
| count | 1663 | 1102 |
| mean | 2.92 | 2.10 |
| std | 1.90 | 1.58 |
| min | 0.21 | 0.00 |
| 25% | 1.64 | 0.91 |
| 50% | 2.57 | 1.68 |
| 75% | 3.66 | 2.97 |
| max | 14.37 | 13.50 |
+-------+---------------+---------------+

Latency values (in ms)
Workload:
One 1% task per CPU, period = 32ms. This allows to wake up the CPU
every 32ms and send more requests/give more work to the SCP. Indeed
the SCP is also responsible of idle state transitions.
Test duration ~=30s.
+---------------------------------------+
| | Without FC | With FC |
+-------+---------------+---------------+
| count | 1629 | 1446 |
| mean | 3.23 | 2.31 |
| std | 2.40 | 1.73 |
| min | 0.05 | 0.02 |
| 25% | 1.91 | 0.98 |
| 50% | 2.65 | 2.00 |
| 75% | 3.65 | 3.23 |
| max | 20.56 | 16.73 |
+-------+---------------+---------------+

---

1.
With this patch, platforms like the Juno which:
- don't set a `transition_delay_us`
- have a high `transition_latency` (> 1ms)
can request freq. changes every 2ms.

If a platform has a `transition_latency` > 2ms, this means:
`transition_latency` > `transition_delay_us`
I.e. a second freq. requests might be emitted before the first one
will be completed. On the Juno, this doesn't cause any 'real' issue
as the SCMI/mailbox mechanism works well, but this doesn't seem
correct.
If the util of CPUs is in between OPPs (i.e. freq. changes are often
required), the Juno:
- sends a freq. request
- waits for completion and schedules another task in the meantime
- upon completion, immediately sends a new freq.

I think that the following should be respected/checked:
- `transition_latency` < `transition_delay_us`
(it might also make sense to have, with K being any factor:)
- `transition_latency` * K < `transition_delay_us`

Makes sense. How about this? I am not sure it is better to multiply with
a factor if the platform is already slow. Even the current 1000 multiply factor
is high but this is a territory I am not ready to step into yet.

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 66cef33c4ec7..68a5ba24a5e0 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -576,6 +576,15 @@ unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy)

latency = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
if (latency) {
+ unsigned int max_delay_us = 2 * MSEC_PER_SEC;;
+
+ /*
+ * If the platform already has high transition_latency, use it
+ * as-is.
+ */
+ if (latency > max_delay_us)
[1] return min(latency, 10ms);
+ return latency;
+
/*
* For platforms that can change the frequency very fast (< 10
* us), the above formula gives a decent transition delay. But
@@ -586,7 +595,7 @@ unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy)
* a reasonable amount of time after which we should reevaluate
* the frequency.
*/
- return min(latency * LATENCY_MULTIPLIER, (unsigned int)(2 * MSEC_PER_SEC));
+ return min(latency * LATENCY_MULTIPLIER, max_delay_us);
}

return LATENCY_MULTIPLIER;


A policy with these values:
- transition_delay_us = 0
- transition_latency = 30ms
would get a transition_delay of 30ms I think.

Maybe it would be better to default to the old value in this case [1].

---

Also a note that on the Pixel6 I have, transition_latency=5ms,
so the platform's policies would end up with transition_delay=5ms




2.
There are references to the 10ms values at other places in the code:

include/linux/cpufreq.h
* ondemand governor will work on any processor with transition latency <= 10ms,

Not sure this one needs updating. Especially with the change above which means
10ms could theoretically happen. But if there are suggestions happy to take
them.

a.
LATENCY_MULTIPLIER introduction:
112124ab0a9f ("[CPUFREQ] ondemand/conservative: sanitize sampling_rate restrictions")

b.
max_transition_latency removal:
ed4676e25463 ("cpufreq: Replace "max_transition_latency" with "dynamic_switching"")

c.
dynamic_switching removal:
9a2a9ebc0a75 ("cpufreq: Introduce governor flags")

The value could be removed independently from this patch indeed, as this is not
related to cpufreq_policy_transition_delay_us() since b.




drivers/cpufreq/cpufreq.c
* For platforms that can change the frequency very fast (< 10
* us), the above formula gives a decent transition delay. But
-> the 10us value matches 10ms = 10us * LATENCY_MULTIPLIER

I can't find this one.

It's in cpufreq_policy_transition_delay_us(),
"the 10us value matches 10ms = 10us * LATENCY_MULTIPLIER"
is a sentence I wrote, the comment to modify would be:
"""
* For platforms that can change the frequency very fast (< 10
* us), the above formula gives a decent transition delay. But
"""



Documentation/admin-guide/pm/cpufreq.rst
Typically, it is set to values of the order of 10000 (10 ms). Its
default value is equal to the value of ``cpuinfo_transition_latency``

I am not sure about this one. It refers to cpuinfo_transition_latency not the
delay and uses a formula to calculate it based on that.

Seems the paragraph needs updating in general to reflect other changes?

aa7519af450d ("cpufreq: Use transition_delay_us for legacy governors as well")

The cpufreq_policy_transition_delay_us() was introduced there and integrates the
10ms value related to this paragraph.

---

I assume that if we keep the 10ms value in the code, it should be ok to let
the comment as is. I'll send a patch to remove the first one as it can be
done independently.




3.
There seems to be a dependency of the conservative/ondemand governors
over the the value returned by `cpufreq_policy_transition_delay_us()`:

drivers/cpufreq/cpufreq_governor.c
dbs_data->sampling_rate = max_t(unsigned int,
CPUFREQ_DBS_MIN_SAMPLING_INTERVAL, // = 2 * tick period = 8ms
cpufreq_policy_transition_delay_us(policy)); // [1]: val <= 2ms

[1]
if `transition_latency` is not set and `transition_delay_us` is,
which is the case for the Juno.

The `sampling_rate` is, FYIU, the period used to evaluate the ratio
of the idle/busy time, and if necessary increase/decrease the freq.

This patch will likely reduce this sampling rate from 10ms -> 8ms
(if `cpufreq_policy_transition_delay_us()`` now returns 2ms for some
platforms). This is not much, but just wanted to note it.

I don't think this is a problem as tick being 1ms is common and
transition_delay_us is not 10ms on all platforms. On my amd system it is 1ms
and on another intel i5 it is 5ms. So it should have already been coping with
various combination.

Ok right,

Regards,
Pierre