Re: [PATCH 4/4] arm64: dts: rockchip: Add OPP data for CPU cores on RK3588

From: Alexey Charkov
Date: Sun Jan 28 2024 - 14:33:36 EST


On Sun, Jan 28, 2024 at 7:06 PM Daniel Lezcano
<daniel.lezcano@xxxxxxxxxx> wrote:
>
>
> Hi Alexey,

Hi Daniel,

> On 27/01/2024 20:41, Alexey Charkov wrote:
> > On Sat, Jan 27, 2024 at 12:33 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
> >>
> >> On 2024-01-26 14:44, Alexey Charkov wrote:
> >>> On Fri, Jan 26, 2024 at 4:56 PM Daniel Lezcano
> >>> <daniel.lezcano@xxxxxxxxxx> wrote:
> >>>> On 26/01/2024 08:49, Dragan Simic wrote:
> >>>>> On 2024-01-26 08:30, Alexey Charkov wrote:
> >>>>>> On Fri, Jan 26, 2024 at 11:05 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
> >>>>>>> On 2024-01-26 07:44, Alexey Charkov wrote:
> >>>>>>>> On Fri, Jan 26, 2024 at 10:32 AM Dragan Simic <dsimic@xxxxxxxxxxx>
> >>>>>>>> wrote:
> >>>>>>>>> On 2024-01-25 10:30, Daniel Lezcano wrote:
> >>>>>>>>>> On 24/01/2024 21:30, Alexey Charkov wrote:
> >>>>>>>>>>> By default the CPUs on RK3588 start up in a conservative
> >>>>>>> performance
> >>>>>>>>>>> mode. Add frequency and voltage mappings to the device tree to
> >>>>>>> enable
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>>> Throttling would also lower the voltage at some point, which cools it
> >>>>>> down much faster!
> >>>>>
> >>>>> Of course, but the key is not to cool (and slow down) the CPU cores too
> >>>>> much, but just enough to stay within the available thermal envelope,
> >>>>> which is where the same-voltage, lower-frequency OPPs should shine.
> >>>>
> >>>> That implies the resulting power is sustainable which I doubt it is
> >>>> the
> >>>> case.
> >>>>
> >>>> The voltage scaling makes the cooling effect efficient not the
> >>>> frequency.
> >>>>
> >>>> For example:
> >>>> opp5 = opp(2GHz, 1V) => 2 BogoWatt
> >>>> opp4 = opp(1.9GHz, 1V) => 1.9 BogoWatt
> >>>> opp3 = opp(1.8GHz, 0.9V) => 1.458 BogoWatt
> >>>> [ other states but we focus on these 3 ]
> >>>>
> >>>> opp5->opp4 => -5% compute capacity, -5% power, ratio=1
> >>>> opp4->opp3 => -5% compute capacity, -23.1% power, ratio=21,6
> >>>>
> >>>> opp5->opp3 => -10% compute capacity, -27.1% power, ratio=36.9
> >>>>
> >>>> In burst operation (no thermal throttling), opp4 is pointless we agree
> >>>> on that.
> >>>>
> >>>> IMO the following will happen: in burst operation with thermal
> >>>> throttling we hit the trip point and then the step wise governor
> >>>> reduces
> >>>> opp5 -> opp4. We have slight power reduction but the temperature does
> >>>> not decrease, so at the next iteration, it is throttle at opp3. And at
> >>>> the end we have opp4 <-> opp3 back and forth instead of opp5 <-> opp3.
> >>>>
> >>>> It is probable we end up with an equivalent frequency average (or
> >>>> compute capacity avg).
> >>>>
> >>>> opp4 <-> opp3 (longer duration in states, less transitions)
> >>>> opp5 <-> opp3 (shorter duration in states, more transitions)
> >>>>
> >>>> Some platforms had their higher OPPs with the same voltage and they
> >>>> failed to cool down the CPU in the long run.
> >>>>
> >>>> Anyway, there is only one way to check it out :)
> >>>>
> >>>> Alexey, is it possible to compare the compute duration for 'dhrystone'
> >>>> with these voltage OPP and without ? (with a period of cool down
> >>>> between
> >>>> the test in order to start at the same thermal condition) ?
> >>>
> >>> Sure, let me try that - would be interesting to see the results. In my
> >>> previous tinkering there were cases when the system stayed at 2.35GHz
> >>> for all big cores for non-trivial time (using the step-wise thermal
> >>> governor), and that's an example of "same voltage, lower frequency".
> >>> Other times though it throttled one cluster down to 1.8GHz and kept
> >>> the other at 2.4GHz, and was also stationary at those parameters for
> >>> extended time. This probably indicates that both of those states use
> >>> sustainable power in my cooling setup.
> >>
> >> IMHO, there are simply too many factors at play, including different
> >> possible cooling setups, so providing additional CPU throttling
> >> granularity can only be helpful. Of course, testing and recording
> >> data is the way to move forward, but I think we should use a few
> >> different tests.
> >
> > Soooo, benchmarking these turned out a bit trickier than I had hoped
> > for. Apparently, dhrystone uses an unsigned int rather than an
> > unsigned long for the loops count (or something of that sort), which
> > means that I can't get it to run enough loops to heat up my chip from
> > a stable idle state to the throttling state (due to counter
> > wraparound). So I ended up with a couple of crutches, namely:
> > - run dhrystone continuously on 6 out of 8 cores to make the chip
> > warm enough (`taskset -c 0-5 ./dhrystone -t 6 -r 6000` - note that on
> > my machine cores 6-7 are usually the first ones to get throttled, due
> > to whatever thermal peculiarities)
> > - wait for the temperature to stabilize (which happens at 79.5C)
> > - then run timed dhrystone on the remaining 2 out of 6 cores (big
> > ones) to see how throttling with different OPP tables affects overall
> > performance.
>
> Thanks for taking the time to test.
>
> > In the end, here's what I got with the 'original' OPP table (including
> > "same voltage - different frequencies" states):
> > alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
> > duration: 0 seconds
> > number of threads: 2
> > number of loops: 4000000000000000
> > delay between starting threads: 0 seconds
> >
> > Dhrystone(1.1) time for 1233977344 passes = 29.7
> > This machine benchmarks at 41481539 dhrystones/second
> > 23609 DMIPS
> > Dhrystone(1.1) time for 1233977344 passes = 29.8
> > This machine benchmarks at 41476618 dhrystones/second
> > 23606 DMIPS
> >
> > Total dhrystone run time: 30.864492 seconds.
> >
> > And here's what I got with the 'reduced' OPP table (keeping only the
> > highest frequency state for each voltage):
> > alchark@rock-5b ~ $ taskset -c 6-7 ./dhrystone -t 2 -l 4000000000
> > duration: 0 seconds
> > number of threads: 2
> > number of loops: 4000000000000000
> > delay between starting threads: 0 seconds
> >
> > Dhrystone(1.1) time for 1233977344 passes = 30.9
> > This machine benchmarks at 39968549 dhrystones/second
> > 22748 DMIPS
> > Dhrystone(1.1) time for 1233977344 passes = 31.0
> > This machine benchmarks at 39817431 dhrystones/second
> > 22662 DMIPS
> >
> > Total dhrystone run time: 31.995136 seconds.
> >
> > Bottomline: removing the lower-frequency OPPs led to a 3.8% drop in
> > performance in this setup. This is probably far from a reliable
> > estimate, but I guess it indeed indicates that having lower-frequency
> > states might be beneficial in some load scenarios.
>
> What is the duration between these two tests?

Several hours and a couple of reboots. I did the first one, recorded
the results and the temperatures, then rebuilt the dtb the next day,
rebooted with it and did everything again with the other OPP table.

> I would be curious if it is repeatable by inverting the setup (reduced
> OPP table and then original OPP table).

Frankly, I can't see how ordering could have mattered, given that I
let the system cool down completely, and also rebooted it to use a
different dtb, so there shouldn't have been any caching effects. Maybe
there is some outside randomness in the results though - perhaps 5-10
repetitions in each case would have been more statistically
meaningful. But then again to make it statistically meaningful I'd
have to peg the other (non-benchmarked) cores to a static OPP to
ensure the thermal governor doesn't play with them when not asked to -
and it all starts to sound like a rabbit hole :)

> BTW: I used -l 10000 for a ~30 seconds workload more or less on the
> rk3399, may be -l 20000 will be ok for the rk3588.

-l 20000 with two threads also gives me about ~30 seconds runtime...
While -l 200000 completed in 25 seconds *facepalm*

Best regards,
Alexey