Re: [PATCH] thermal: core: Check correct temperature for thermal trip notification

From: Nícolas F. R. A. Prado
Date: Mon Sep 25 2023 - 12:20:51 EST


On Mon, Sep 25, 2023 at 12:19:18PM +0200, Daniel Lezcano wrote:
>
> Hi Nicolas,
>
> On 22/09/2023 21:27, Nícolas F. R. A. Prado wrote:
> > The thermal trip down notification should be triggered when the
> > temperature goes below the trip temperature minus the hysteresis. But
> > while the temperature is correctly checked against that, the last
> > temperature is instead compared against the trip point temperature. The
> > end result is that the notification won't always be triggered, only when
> > the temperature happens to drop quick enough so that the last
> > temperature was still above the trip point temperature.
> >
> > Fix the incorrect check.
> >
> > Fixes: 55cdf0a283b8 ("thermal: core: Add notifications call in the framework")
> > Signed-off-by: Nícolas F. R. A. Prado <nfraprado@xxxxxxxxxxxxx>
> >
> > ---
> >
> > drivers/thermal/thermal_core.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 58533ea75cd9..120fcf23b8e5 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -361,7 +361,7 @@ static void handle_thermal_trip(struct thermal_zone_device *tz, int trip_id)
> > tz->temperature >= trip.temperature)
> > thermal_notify_tz_trip_up(tz->id, trip_id,
> > tz->temperature);
> > - if (tz->last_temperature >= trip.temperature &&
> > + if (tz->last_temperature >= (trip.temperature - trip.hysteresis) &&
> > tz->temperature < (trip.temperature - trip.hysteresis))
> > thermal_notify_tz_trip_down(tz->id, trip_id,
> > tz->temperature);
>
> We already did a try to fix the thermal trip cross notification but this is
> not sufficient for a full fix.
>
> We are receiving multiple notifications from the same event, all this due to
> the hysteresis.
>
> Let's say, we have a trip point T and a hysteresis H.
>
> There is a trip point crossed the way up when:
>
> last_temperature < T and temperature >= T
>
> At this point, we send a notification
>
> Now, the temperature decreases but it stays in the hysteresis:
>
> last_temperature >= T and temperature > (T - H)
>
> And then, the temperature increases again and is greater than T.
>
> last_temperature > (T - H) and temperature >= T
>
> We send a second notification.

Right, I've observed this issue in the logic that updates the trip points, and
just sent the v2 fix for it:

https://lore.kernel.org/all/20230922184425.290894-1-nfraprado@xxxxxxxxxxxxx

I wasn't aware of the cleanups you pointed to below, but at least in their
current form it doesn't seem they would take care of fixing the trip point
update logic like I did in that patch. So please do take a look.

It was while testing that patch that I stumbled upon this code for the
notification and just quickly fixed it. But indeed this patch is not a full fix
as the one you pointed to, so let's wait for that one.

Thanks,
Nícolas

>
> With the mitigation kicking in at temp >= T, we end up with multiple events
> 'temperature crossed the trip point the way up"'. That forces the receiver
> of the events to detect duplicate events in order to ignore them.
>
> More info:
>
> https://lore.kernel.org/all/20220718145038.1114379-4-daniel.lezcano@xxxxxxxxxx/
>
> We have done a lot of cleanups to use the 'generic trip points' and remove
> those get_trip_* ops. So the trip point structure is a core component being
> reused by the drivers and registered as an array.
>
> Next step is to make sure the trip points are ordered in the array, so we
> can implement the correct trip point crossing detection.
>
>
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>