Re: [PATCH] arm64: dts: rockchip: enable built-in thermal monitoring on rk3588

From: Alexey Charkov
Date: Mon Jan 22 2024 - 01:03:51 EST


On Mon, Jan 22, 2024 at 8:55 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
>
> Hello Alexey,
>
> On 2024-01-21 19:56, Alexey Charkov wrote:
> > On Thu, Jan 18, 2024 at 10:48 PM Dragan Simic <dsimic@xxxxxxxxxxx>
> > wrote:
> >> On 2024-01-08 14:41, Alexey Charkov wrote:
> >> I apologize for my delayed response. It took me almost a month to
> >> nearly fully recover from some really nasty flu that eventually went
> >> into my lungs. It was awful and I'm still not back to my 100%. :(
> >
> > Ouch, I hope you get well soon!
>
> Thank you, let's hope so. It's been really exhausting. :(
>
> >> > On Sun, Jan 7, 2024 at 2:54 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
> >> >> On 2024-01-06 23:23, Alexey Charkov wrote:
> >> >> > Include thermal zones information in device tree for rk3588 variants
> >> >> > and enable the built-in thermal sensing ADC on RADXA Rock 5B
> >> >> >
> >> >> > Signed-off-by: Alexey Charkov <alchark@xxxxxxxxx>
> >> >> > ---
> >> >> > diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >> >> > b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >> >> > index 8aa0499f9b03..8235991e3112 100644
> >> >> > --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >> >> > +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >> >> > @@ -10,6 +10,7 @@
> >> >> > #include <dt-bindings/reset/rockchip,rk3588-cru.h>
> >> >> > #include <dt-bindings/phy/phy.h>
> >> >> > #include <dt-bindings/ata/ahci.h>
> >> >> > +#include <dt-bindings/thermal/thermal.h>
> >> >> >
> >> >> > / {
> >> >> > compatible = "rockchip,rk3588";
> >> >> > @@ -2112,6 +2113,148 @@ tsadc: tsadc@fec00000 {
> >> >> > status = "disabled";
> >> >> > };
> >> >> >
> >> >> > + thermal_zones: thermal-zones {
> >> >> > + soc_thermal: soc-thermal {
> >> >>
> >> >> It should be better to name it cpu_thermal instead. In the end,
> >> >> that's what it is.
> >> >
> >> > The TRM document says the first TSADC channel (to which this section
> >> > applies) measures the temperature near the center of the SoC die,
> >> > which implies not only the CPU but also the GPU at least. RADXA's
> >> > kernel for Rock 5B also has GPU passive cooling as one of the cooling
> >> > maps under this node (not included here, as we don't have the GPU node
> >> > in .dtsi just yet). So perhaps naming this one cpu_thermal could be
> >> > misleading?
> >>
> >> Ah, I see now, thanks for reminding; it's all described on page 1,372
> >> of the first part of the RK3588 TRM v1.0.
> >>
> >> Having that in mind, I'd suggest that we end up naming it
> >> package_thermal.
> >> The temperature near the center of the chip is usually considered to
> >> be
> >> the overall package temperature; for example, that's how the
> >> user-facing
> >> CPU temperatures are measured in the x86_64 world.
> >
> > Sounds good, will rename in v3!
>
> Thanks, I'm glad you agree.
>
> >> >> > + trips {
> >> >> > + threshold: trip-point-0 {
> >> >>
> >> >> It should be better to name it cpu_alert0 instead, because that's what
> >> >> other newer dtsi files already use.
> >> >
> >> > Reflecting on your comments here and below, I'm thinking that maybe it
> >> > would be better to define only the critical trip point for the SoC
> >> > overall, and then have alerts along with the respective cooling maps
> >> > separately for A76-0,1, A76-2,3, A55-0,1,2,3? After all, given that we
> >> > have more granular temperature measurement here than in previous RK
> >> > chipsets it might be better to only throttle the "offending" cores,
> >> > not the full package.
> >> >
> >> > What do you think?
> >> >
> >> > Downstream DT doesn't follow this approach though, so maybe there's
> >> > something I'm missing here.
> >>
> >> I agree, it's better to fully utilize the higher measurement
> >> granularity
> >> made possible by having multiple temperature sensors available.
> >>
> >> I also agree that we should have only the critical trip defined for
> >> the
> >> package-level temperature sensor. Let's have the separate temperature
> >> measurements for the CPU (sub)clusters do the thermal throttling, and
> >> let's keep the package-level measurement for the critical shutdowns
> >> only. IIRC, some MediaTek SoC dtsi already does exactly that.
> >>
> >> Of course, there are no reasons not to have the critical trips defined
> >> for the CPU (sub)clusters as well.
> >
> > I think I'll also add a board-specific active cooling mechanism on the
> > package level in the next iteration, given that Rock 5B has a PWM fan
> > defined as a cooling device. That will go in the separate patch that
> > updates rk3588-rock-5b.dts (your feedback to v2 of this patch is also
> > duly noted, thank you!)
>
> Great, thanks. Sure, making use of the Rock 5B's support for attaching
> a PWM-controlled cooling fan is the way to go.
>
> Just to reiterate a bit, any "active" trip points belong to the board
> dts
> file(s), because having a cooling fan is a board-specific feature. As a
> note, you may also want to have a look at the RockPro64 dts(i) files,
> for
> example; the RockPro64 also comes with a cooling fan connector and the
> associated PWM fan control logic.

Thanks for the pointer! There is also a helpful doc within devicetree
bindings descriptions, although it sits under hwmon which was a bit
confusing to me. I've already tested it locally (by adding to the
board dts), and it spins up and down quite nicely, and even modulates
the fan speed swiftly when the load changes - yay!

> >> >> > + temperature = <75000>;
> >> >> > + hysteresis = <2000>;
> >> >> > + type = "passive";
> >> >> > + };
> >> >> > + target: trip-point-1 {
> >> >>
> >> >> It should be better to name it cpu_alert1 instead, because that's what
> >> >> other newer dtsi files already use.
> >> >>
> >> >> > + temperature = <85000>;
> >> >> > + hysteresis = <2000>;
> >> >> > + type = "passive";
> >> >> > + };
> >> >> > + soc_crit: soc-crit {
> >> >>
> >> >> It should be better to name it cpu_crit instead, because that's what
> >> >> other newer dtsi files already use.
> >> >
> >> > Seems to me that if I define separate trips for the three groups of
> >> > CPU cores as mentioned above this would better stay as soc_crit, as it
> >> > applies to the whole die rather than the CPU cluster alone. Then
> >> > 'threshold' and 'target' will go altogether, and I'll have separate
> >> > *_alert0 and *_alert1 per CPU group.
> >>
> >> It should perhaps be the best to have "passive", "hot" and "critical"
> >> trips defined for all three CPU groups/(sub)clusters, separately of
> >> course, to have even higher granularity when it comes to the resulting
> >> thermal throttling.
> >
> > I looked through drivers/thermal/rockchip_thermal.c, and it doesn't
> > seem to provide any callback for the "hot" trip as part of its struct
> > thermal_zone_device_ops, so I guess it would be redundant in our case
> > here? I couldn't find any generic mechanism to react to "hot" trips,
> > and they seem to be purely driver-specific, thus no-op in case of
> > Rockchips - or am I missing something?
>
> That's a good question. Please, let me go through the code in detail,
> and I'll get back with an update soon. Also, please wait a bit with
> sending the v3, until all open questions are addressed.

Of course. Thank you for taking the time to dig through this one with me!

Best regards,
Alexey