Re: [PATCH v3 1/5] arm64: dts: rockchip: enable built-in thermal monitoring on RK3588

From: Alexey Charkov
Date: Fri Mar 01 2024 - 00:13:12 EST


Hi Dragan,

On Fri, Mar 1, 2024 at 12:21 AM Dragan Simic <dsimic@xxxxxxxxxxx> wrote:
>
> Hello Alexey,
>
> On 2024-02-29 20:26, Alexey Charkov wrote:
> > Include thermal zones information in device tree for RK3588 variants.
> >
> > This also enables the TSADC controller unconditionally on all boards
> > to ensure that thermal protections are in place via throttling and
> > emergency reset, once OPPs are added to enable CPU DVFS.
> >
> > The default settings (using CRU as the emergency reset mechanism)
> > should work on all boards regardless of their wiring, as CRU resets
> > do not depend on any external components. Boards that have the TSHUT
> > signal wired to the reset line of the PMIC may opt to switch to GPIO
> > tshut mode instead (rockchip,hw-tshut-mode = <1>;)
>
> Quite frankly, I'm still not sure that enabling this on the SoC level
> is the way to go. As I already described in detail, [4] according to
> the RK3588 Hardware Design Guide v1.0 and the Rock 5B schematic, we
> should actually use GPIO-based handling for the thermal runaways on
> the Rock 5B. Other boards should also be investigated individually,
> and the TSADC should be enabled on a board-to-board basis.

With all due respect, I disagree, here is why:
- Neither the schematic nor the hardware design guide, on which the
schematic seems to be based, prescribes a particular way to handle
thermal runaways. They only provide the possibility of GPIO based
resets, along with the CRU based one
- My strong belief is that defaults (regardless of context) should be
safe and reasonable, and should also minimize the need to override
them
- In context of dts/dtsi, as far as I understand the general logic
behind the split, the SoC .dtsi should contain all the things that are
fully contained within the SoC and do not depend on the wiring of a
particular board or its target use case. Boards then
add/remove/override settings to match their wiring and use case more
closely

In the light of the last two points, I believe that enabling TSADC by
default is the more safe and reasonable choice, because it provides
crucial thermal protection logic for the SoC, and it can do so in a
board-agnostic way (if the CRU based reset is selected, which is the
current default).

Furthermore, TSADC and CRU are fully contained within the SoC, and I
cannot think of a use case where a board might be somehow
disadvantaged by TSADC being enabled, and thus need to disable it
altogether (maybe I'm missing something). The only thing that the
board might be adjusting is the thermal reset handling, and even then
it's rather a matter of choice/preference to switch away from CRU to
GPIO resets where the wiring permits it, rather than an existential
need. I presume that a PMIC-assisted reset causes deeper power cycling
of the SoC and might therefore help in some rare cases where the CRU
reset alone is not enough, but that would be niche.

All summed up, I believe that the default of "fry my board if I have
no heatsink and forget to include &tsadc {status = <okay>;}; in my
dts" is substantially inferior to the default of "my board could do a
deep power-cycle in this weird corner-case thermal-runaway situation
that somehow didn't get handled by active cooling, then by passive
cooling, then by a CRU reset, but I didn't include
rockchip,hw-tshut-mode = <1>; so poor luck for me".

Would be great to hear other perspectives from people on the list.

Best regards,
Alexey

> [4]
> https://lore.kernel.org/linux-rockchip/4e7c2b5a938bd7c919b852699c951701@xxxxxxxxxxx/
>
> > It seems though that downstream kernels don't use that, even for
> > those boards where the wiring allows for GPIO based tshut, such as
> > Radxa Rock 5B [1], [2], [3]
> >
> > [1]
> > https://github.com/radxa/kernel/blob/stable-5.10-rock5/arch/arm64/boot/dts/rockchip/rk3588-rock-5b.dts#L540
> > [2]
> > https://github.com/radxa/kernel/blob/stable-5.10-rock5/arch/arm64/boot/dts/rockchip/rk3588s.dtsi#L5433
> > [3] https://dl.radxa.com/rock5/5b/docs/hw/radxa_rock_5b_v1423_sch.pdf
> > page 11 (TSADC_SHUT_H)
> >
> > Signed-off-by: Alexey Charkov <alchark@xxxxxxxxx>
> > ---
> > arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 176
> > +++++++++++++++++++++++++++++-
> > 1 file changed, 175 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > index 36b1b7acfe6a..9bf197358642 100644
> > --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > @@ -10,6 +10,7 @@
> > #include <dt-bindings/reset/rockchip,rk3588-cru.h>
> > #include <dt-bindings/phy/phy.h>
> > #include <dt-bindings/ata/ahci.h>
> > +#include <dt-bindings/thermal/thermal.h>
> >
> > / {
> > compatible = "rockchip,rk3588";
> > @@ -2225,7 +2226,180 @@ tsadc: tsadc@fec00000 {
> > pinctrl-1 = <&tsadc_shut>;
> > pinctrl-names = "gpio", "otpout";
> > #thermal-sensor-cells = <1>;
> > - status = "disabled";
> > + status = "okay";
> > + };
> > +
> > + thermal_zones: thermal-zones {
> > + /* sensor near the center of the SoC */
> > + package_thermal: package-thermal {
> > + polling-delay-passive = <0>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 0>;
> > +
> > + trips {
> > + package_crit: package-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + };
> > +
> > + /* sensor between A76 cores 0 and 1 */
> > + bigcore0_thermal: bigcore0-thermal {
> > + polling-delay-passive = <100>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 1>;
> > +
> > + trips {
> > + /* threshold to start collecting temperature
> > + * statistics e.g. with the IPA governor
> > + */
> > + bigcore0_alert0: bigcore0-alert0 {
> > + temperature = <75000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + /* actual control temperature */
> > + bigcore0_alert1: bigcore0-alert1 {
> > + temperature = <85000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + bigcore0_crit: bigcore0-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + cooling-maps {
> > + map0 {
> > + trip = <&bigcore0_alert1>;
> > + cooling-device =
> > + <&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > + <&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> > + };
> > + };
> > + };
> > +
> > + /* sensor between A76 cores 2 and 3 */
> > + bigcore2_thermal: bigcore2-thermal {
> > + polling-delay-passive = <100>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 2>;
> > +
> > + trips {
> > + /* threshold to start collecting temperature
> > + * statistics e.g. with the IPA governor
> > + */
> > + bigcore2_alert0: bigcore2-alert0 {
> > + temperature = <75000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + /* actual control temperature */
> > + bigcore2_alert1: bigcore2-alert1 {
> > + temperature = <85000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + bigcore2_crit: bigcore2-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + cooling-maps {
> > + map0 {
> > + trip = <&bigcore2_alert1>;
> > + cooling-device =
> > + <&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > + <&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> > + };
> > + };
> > + };
> > +
> > + /* sensor between the four A55 cores */
> > + little_core_thermal: littlecore-thermal {
> > + polling-delay-passive = <100>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 3>;
> > +
> > + trips {
> > + /* threshold to start collecting temperature
> > + * statistics e.g. with the IPA governor
> > + */
> > + littlecore_alert0: littlecore-alert0 {
> > + temperature = <75000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + /* actual control temperature */
> > + littlecore_alert1: littlecore-alert1 {
> > + temperature = <85000>;
> > + hysteresis = <2000>;
> > + type = "passive";
> > + };
> > + littlecore_crit: littlecore-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + cooling-maps {
> > + map0 {
> > + trip = <&littlecore_alert1>;
> > + cooling-device =
> > + <&cpu_l0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > + <&cpu_l1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > + <&cpu_l2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > + <&cpu_l3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> > + };
> > + };
> > + };
> > +
> > + /* sensor near the PD_CENTER power domain */
> > + center_thermal: center-thermal {
> > + polling-delay-passive = <0>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 4>;
> > +
> > + trips {
> > + center_crit: center-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + };
> > +
> > + gpu_thermal: gpu-thermal {
> > + polling-delay-passive = <0>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 5>;
> > +
> > + trips {
> > + gpu_crit: gpu-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + };
> > +
> > + npu_thermal: npu-thermal {
> > + polling-delay-passive = <0>;
> > + polling-delay = <0>;
> > + thermal-sensors = <&tsadc 6>;
> > +
> > + trips {
> > + npu_crit: npu-crit {
> > + temperature = <115000>;
> > + hysteresis = <0>;
> > + type = "critical";
> > + };
> > + };
> > + };
> > };
> >
> > saradc: adc@fec10000 {