Re: [PATCH v3 0/5] RK3588 and Rock 5B dts additions: thermal, OPP and fan

From: Alexey Charkov
Date: Tue Mar 05 2024 - 03:07:12 EST


Hi Sebastian!

On Mon, Mar 4, 2024 at 9:51 PM Sebastian Reichel
<sebastian.reichel@xxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Thu, Feb 29, 2024 at 11:26:31PM +0400, Alexey Charkov wrote:
> > This enables thermal monitoring and CPU DVFS on RK3588(s), as well as
> > active cooling on Radxa Rock 5B via the provided PWM fan.
> >
> > Some RK3588 boards use separate regulators to supply CPUs and their
> > respective memory interfaces, so this is handled by coupling those
> > regulators in affected boards' device trees to ensure that their
> > voltage is adjusted in step.
> >
> > In this revision of the series I chose to enable TSADC for all boards
> > at .dtsi level, because:
> > - The defaults already in .dtsi should work for all users, given that
> > the CRU based resets don't need any out-of-chip components, and
> > the CRU vs. PMIC reset is pretty much the only thing a board might
> > have to configure / override there
> > - The boards that have TSADC_SHUT signal wired to the PMIC reset line
> > can still choose to override the reset logic in their .dts. Or stay
> > with CRU based resets, as downstream kernels do anyway
> > - The on-by-default approach helps ensure thermal protections are in
> > place (emergency reset and throttling) for any board even with a
> > rudimentary .dts, and thus lets us introduce CPU DVFS with better
> > peace of mind
> >
> > Fan control on Rock 5B has been split into two intervals: let it spin
> > at the minimum cooling state between 55C and 65C, and then accelerate
> > if the system crosses the 65C mark - thanks to Dragan for suggesting.
> > This lets some cooling setups with beefier heatsinks and/or larger
> > fan fins to stay in the quietest non-zero fan state while still
> > gaining potential benefits from the airflow it generates, and
> > possibly avoiding noisy speeds altogether for some workloads.
> >
> > OPPs help actually scale CPU frequencies up and down for both cooling
> > and performance - tested on Rock 5B under varied loads. I've split
> > the patch into two parts: the first containing those OPPs that seem
> > to be no-regret with general consensus during v1 review [2], while
> > the second contains OPPs that cause frequency reductions without
> > accompanying decrease in CPU voltage. There seems to be a slight
> > performance gain in some workload scenarios when using these, but
> > previous discussion was inconclusive as to whether they should be
> > included or not. Having them as separate patches enables easier
> > comparison and partial reversion if people want to test it under
> > their workloads, and also enables the first 'no-regret' part to be
> > merged to -next while the jury is still out on the second one.
> >
> > [1] https://lore.kernel.org/linux-rockchip/1824717.EqSB1tO5pr@bagend/T/#ma2ab949da2235a8e759eab22155fb2bc397d8aea
> > [2] https://lore.kernel.org/linux-rockchip/CABjd4YxqarUCbZ-a2XLe3TWJ-qjphGkyq=wDnctnEhdoSdPPpw@xxxxxxxxxxxxxx/T/#m49d2b94e773f5b532a0bb5d3d7664799ff28cc2c
> >
> > Signed-off-by: Alexey Charkov <alchark@xxxxxxxxx>
> > ---
> > Changes in v3:
> > - Added regulator coupling for EVB1 and QuartzPro64
> > - Enabled the TSADC for all boards in .dtsi, not just Rock 5B (thanks ChenYu)
> > - Added comments regarding two passive cooling trips in each zone (thanks Dragan)
> > - Fixed active cooling map numbering for Radxa Rock 5B (thanks Dragan)
> > - Dropped Daniel's Acked-by tag from the Rock 5B fan patch, as there's been quite some
> > churn there since the version he acknowledged
> > - Link to v2: https://lore.kernel.org/r/20240130-rk-dts-additions-v2-0-c6222c4c78df@xxxxxxxxx
> >
> > Changes in v2:
> > - Dropped the rfkill patch which Heiko has already applied
> > - Set higher 'polling-delay-passive' (100 instead of 20)
> > - Name all cooling maps starting from map0 in each respective zone
> > - Drop 'contribution' properties from passive cooling maps
> > - Link to v1: https://lore.kernel.org/r/20240125-rk-dts-additions-v1-0-5879275db36f@xxxxxxxxx
> >
> > ---
> > Alexey Charkov (5):
> > arm64: dts: rockchip: enable built-in thermal monitoring on RK3588
> > arm64: dts: rockchip: enable automatic active cooling on Rock 5B
> > arm64: dts: rockchip: Add CPU/memory regulator coupling for RK3588
> > arm64: dts: rockchip: Add OPP data for CPU cores on RK3588
> > arm64: dts: rockchip: Add further granularity in RK3588 CPU OPPs
> >
> > arch/arm64/boot/dts/rockchip/rk3588-evb1-v10.dts | 12 +
> > .../arm64/boot/dts/rockchip/rk3588-quartzpro64.dts | 12 +
> > arch/arm64/boot/dts/rockchip/rk3588-rock-5b.dts | 30 +-
> > arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 385 ++++++++++++++++++++-
> > 4 files changed, 437 insertions(+), 2 deletions(-)
>
> I'm too busy to have a detailed review of this series right now, but
> I pushed it to our CI and it results in a board reset at boot time:
>
> https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux/-/jobs/300950
>
> I also pushed just the first three patches (i.e. without OPP /
> cpufreq) and that boots fine:
>
> https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux/-/jobs/300953

Thank you for testing these! I've noticed in the boot log that the CI
machine uses some u-boot 2023.07 - is that a downstream one? Any
chance to compare it to 2023.11 or 2024.01 from your (Collabora)
integration tree?

I use 2023.11 from your integration tree, with a binary bl31, and I'm
not getting those resets even under prolonged heavy load (I rebuild
Chromium with 8 concurrent compilation jobs as the stress test -
that's 14 hours of heavy CPU, memory and IO use). Would be interesting
to understand if it's just a 'lucky' SoC specimen on my side, or if
there is some dark magic happening differently on my machine vs. your
CI machine.

Thinking that maybe if your CI machine uses a downstream u-boot it
might be leaving some extra hardware running (PVTM?) which might do
weird stuff when TSADC/clocks/voltages get readjusted by the generic
cpufreq driver?..

> Note, that OPP / cpufreq works on the same boards in the CI when
> using the ugly-and-not-for-upstream cpufreq driver:
>
> https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux/-/commit/9c90c5032743a0419bf3fd2f914a24fd53101acd
>
> My best guess right now is, that this is related to the generic
> driver obviously not updating the GRF read margin registers.

If it was about memory read margins I believe I would have been
unlikely to get my machine to work reliably under heavy load with the
default ones, but who knows...

Best regards,
Alexey