Re: [PATCH 1/2] thermal: core: add initial support for cold and critical_cold trip point

From: Rafael J. Wysocki
Date: Wed Dec 13 2023 - 10:16:44 EST


On Wed, Dec 13, 2023 at 4:10 PM Daniel Lezcano
<daniel.lezcano@xxxxxxxxxx> wrote:
>
> On 13/12/2023 15:20, Christian Marangi wrote:
> > On Wed, Dec 13, 2023 at 03:12:41PM +0100, Daniel Lezcano wrote:
> >> On 12/12/2023 23:13, Christian Marangi wrote:
> >>> Add initial support for cold and critical_cold trip point. Many if not
> >>> all hwmon and thermal device have normally trip point for hot
> >>> temperature and for cold temperature.
> >>>
> >>> Till now only hot temperature were supported. Add support for also cold
> >>> temperature to permit complete definition of cold trip point in DT.
> >>>
> >>> Thermal driver may use these additional trip point to correctly set
> >>> interrupt for cold temperature values and react based on that with
> >>> various measure like enabling attached heater, forcing higher voltage
> >>> and other specialaized peripherals.
> >>>
> >>> For hwmon drivers this is needed as currently there is a problem with
> >>> setting the full operating range of the device for thermal devices
> >>> defined with hwmon. To better describe the problem, the following
> >>> example is needed:
> >>>
> >>> In the scenario of a simple hwmon with an active trip point declared
> >>> and a cooling device attached, the hwmon subsystem currently set the
> >>> min and max trip point based on the single active trip point.
> >>> Thermal subsystem parse all the trip points and calculate the lowest and
> >>> the highest trip point and calls the .set_trip of hwmon to setup the
> >>> trip points.
> >>>
> >>> The fact that we currently don't have a way to declare the cold/min
> >>> temperature values, makes the thermal subsystem to set the low value as
> >>> -INT_MAX.
> >>> For hwmon drivers that doesn't use clamp_value and actually reject
> >>> invalid values for the trip point, this results in the hwmon settings to
> >>> be rejected.
> >>>
> >>> To permit to pass the correct range of trip point, permit to set in DT
> >>> also cold and critical_cold trip point.
> >>>
> >>> Thermal driver may also define .cold and .critical_cold to act on these
> >>> trip point tripped and apply the required measure.
> >>
> >> Agree with the feature but we need to clarify the semantic of the trip
> >> points first. What actions do we expect for them in order to have like a
> >> mirror reflection of the existing hot trip points.
> >>
> >> What action do you expect with:
> >>
> >> - 'cold' ?
> >>
> >> - 'critical_cold' ?
> >>
> >>
> >
> > This is more of a sensible topic but I think it's the thermal driver
> > that needs to implement these. As said in the commit description,
> > examples are setting higher voltage from the attached regulator,
> > enabling some hardware heater.
>
> That is a warming device. In the thermal framework design it is part of
> the mitigation device actors like a cooling device. The driver does not
> have to deal with that.
>
> > Maybe with critical cold bigger measure can be applied. Currently for
> > critical trip point we shutdown the system (if the critical ops is not
> > declared) but with critical_cold condition I think it won't work... I
> > expect a system in -40°C would just lock up/glitch so rebooting in that
> > condition won't change a thing...
>
> From my POV, a critical trip point is for a too hot or too cold trip
> point. The temperature should be set before the system will be
> malfunctioning, so a halt (or reboot if set) should work.
>
> I'm not in favor of adding more callbacks if we can avoid them. Passing
> the trip point to the critical callback makes more sense to me.
>
> > Anyway yes we can define a shutdown by default for that but IMHO it
> > wouldn't make much sense.
>
> Why? If the device is about to go to out of the functioning range, then
> it makes sense to shut it down. IIRC, electric signals lose their
> stability below the lower bound temperature.
>
> There is the point about the mitigation to stay above a certain
> temperature. If the devices do not have any kind a 'warming' device,
> then an alternative would be to have a generic warming device mirroring
> the cooling device with a duty cycles at different performance states.
> With this case, we may need another trip point type for a governor which
> should handle that.
>
> So we end up with the question: shall we add trip point types or trip
> point properties?
>
> 1. Trip point types
>
> - active / passive : mitigate
> - hot : notify userspace
> - critical : halt by default
> - cold : do something
> - cold_crit : do something else with a callback
>
> 2. Trip point types + property
>
> - active / passive : mitigate
> - hot : cool down
> - cold : warm up
>
> - hot : notify userspace
> - cold : notify userspace
>
> - critical:
> - hot : shutdown (or callback + trip)
> - cold : shutdown (or callback + trip)
>
> That implies there are two models:
>
> 1. We assume cold / hot trip points are symmetric features of the
> thermal management. So we do mitigation using governors, if that
> mitigation fails then we end up with critical actions. A consistent
> behavior between temperature increase or decrease. From my POV, I prefer
> this approach because it reflects nicely the functioning range temperature.

I agree here, FWIW.

> 2. We handle the cold situation differently by doing a on/off action on
> a specific device. That is an asymmetric approach.