Re: [PATCH 4/4] arm64: dts: rockchip: Add OPP data for CPU cores on RK3588

From: Alexey Charkov
Date: Sat Jan 27 2024 - 14:41:27 EST


On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
>
> On 2024-01-26 14:44, Alexey Charkov wrote:
> > On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano
> > <daniel.lezcano@xxxxxxxxxx> wrote:
> >> On 26/01/2024 08:49, Dragan Simic wrote:
> >> > On 2024-01-26 08:30, Alexey Charkov wrote:
> >> >> On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
> >> >>> On 2024-01-26 07:44, Alexey Charkov wrote:
> >> >>> > On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic@xxxxxxxxxxx>
> >> >>> > wrote:
> >> >>> >> On 2024-01-25 10:30, Daniel Lezcano wrote:
> >> >>> >> > On 24/01/2024 21:30, Alexey Charkov wrote:
> >> >>> >> >> By default the CPUs on RK3588 start up in a conservative
> >> >>> performance
> >> >>> >> >> mode. Add frequency and voltage mappings to the device tree to
> >> >>> enable
> >>
> >> [ ... ]
> >>
> >> >> Throttling would also lower the voltage at some point, which cools it
> >> >> down much faster!
> >> >
> >> > Of course, but the key is not to cool (and slow down) the CPU cores too
> >> > much, but just enough to stay within the available thermal envelope,
> >> > which is where the same-voltage, lower-frequency OPPs should shine.
> >>
> >> That implies the resulting power is sustainable which I doubt it is
> >> the
> >> case.
> >>
> >> The voltage scaling makes the cooling effect efficient not the
> >> frequency.
> >>
> >> For example:
> >> opp5 = opp(2GHz, 1V) => 2 BogoWatt
> >> opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt
> >> opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt
> >> [ other states but we focus on these 3 ]
> >>
> >> opp5->opp4 => -5% compute capacity, -5% power, ratio=1
> >> opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6
> >>
> >> opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9
> >>
> >> In burst operation (no thermal throttling), opp4 is pointless we agree
> >> on that.
> >>
> >> IMO the following will happen: in burst operation with thermal
> >> throttling we hit the trip point and then the step wise governor
> >> reduces
> >> opp5 -> opp4. We have slight power reduction but the temperature does
> >> not decrease, so at the next iteration, it is throttle at opp3. And at
> >> the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3.
> >>
> >> It is probable we end up with an equivalent frequency average (or
> >> compute capacity avg).
> >>
> >> opp4 <-> opp3 (longer duration in states, less transitions)
> >> opp5 <-> opp3 (shorter duration in states, more transitions)
> >>
> >> Some platforms had their higher OPPs with the same voltage and they
> >> failed to cool down the CPU in the long run.
> >>
> >> Anyway, there is only one way to check it out :)
> >>
> >> Alexey, is it possible to compare the compute duration for 'dhrystone'
> >> with these voltage OPP and without ? (with a period of cool down
> >> between
> >> the test in order to start at the same thermal condition) ?
> >
> > Sure, let me try that - would be interesting to see the results. In my
> > previous tinkering there were cases when the system stayed at 2.35GHz
> > for all big cores for non-trivial time (using the step-wise thermal
> > governor), and that's an example of "same voltage, lower frequency".
> > Other times though it throttled one cluster down to 1.8GHz and kept
> > the other at 2.4GHz, and was also stationary at those parameters for
> > extended time. This probably indicates that both of those states use
> > sustainable power in my cooling setup.
>
> IMHO, there are simply too many factors at play, including different
> possible cooling setups, so providing additional CPU throttling
> granularity can only be helpful. Of course, testing and recording
> data is the way to move forward, but I think we should use a few
> different tests.

Soooo, benchmarking these turned out a bit trickier than I had hoped
for. Apparently, dhrystone uses an unsigned int rather than an
unsigned long for the loops count (or something of that sort), which
means that I can't get it to run enough loops to heat up my chip from
a stable idle state to the throttling state (due to counter
wraparound). So I ended up with a couple of crutches, namely:
- run dhrystone continuously on 6 out of 8 cores to make the chip
warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on
my machine cores 6-7 are usually the first ones to get throttled, due
to whatever thermal peculiarities)
- wait for the temperature to stabilize (which happens at 79.5C)
- then run timed dhrystone on the remaining 2 out of 6 cores (big
ones) to see how throttling with different OPP tables affects overall
performance.

In the end, here's what I got with the 'original' OPP table (including
"same voltage - different frequencies" states):
alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
duration: 0 seconds
number of threads: 2
number of loops: 4000000000000000
delay between starting threads: 0 seconds

Dhrystone(1.1) time for 1233977344 passes = 29.7
This machine benchmarks at 41481539 dhrystones/second
23609 DMIPS
Dhrystone(1.1) time for 1233977344 passes = 29.8
This machine benchmarks at 41476618 dhrystones/second
23606 DMIPS

Total dhrystone run time: 30.864492 seconds.

And here's what I got with the 'reduced' OPP table (keeping only the
highest frequency state for each voltage):
alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
duration: 0 seconds
number of threads: 2
number of loops: 4000000000000000
delay between starting threads: 0 seconds

Dhrystone(1.1) time for 1233977344 passes = 30.9
This machine benchmarks at 39968549 dhrystones/second
22748 DMIPS
Dhrystone(1.1) time for 1233977344 passes = 31.0
This machine benchmarks at 39817431 dhrystones/second
22662 DMIPS

Total dhrystone run time: 31.995136 seconds.

Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in
performance in this setup. This is probably far from a reliable
estimate, but I guess it indeed indicates that having lower-frequency
states might be beneficial in some load scenarios.

Note though that several seconds after hitting the throttling
threshold cores 6-7 were oscillating between 1.608GHz and 1.8GHz in
both runs, which implies that the whole difference in performance was
due to different speed of initial throttling (i.e. it might be a
peculiarity of the step-wise thermal governor operation when it has to
go through more cooling states to reach the "steady-state" one). Given
that both 1.608GHz and 1.8GHz have no lower-frequency same-voltage
siblings in either of the OPP tables, it implies that under prolonged
constant load there should be no performance difference at all.

Best regards,
Alexey