Re: [PATCH] thermal/core: Emit a warning if the thermal zone is updated without ops

From: Lukasz Luba
Date: Tue Dec 08 2020 - 09:57:22 EST




On 12/8/20 2:37 PM, Lukasz Luba wrote:


On 12/8/20 1:51 PM, Daniel Lezcano wrote:

Hi Lukasz,

On 08/12/2020 10:36, Lukasz Luba wrote:
Hi Daniel,

[ ... ]

     static void thermal_zone_device_init(struct thermal_zone_device *tz)
@@ -553,11 +555,9 @@ void thermal_zone_device_update(struct
thermal_zone_device *tz,
       if (atomic_read(&in_suspend))
           return;
   -    if (!tz->ops->get_temp)
+    if (update_temperature(tz))
           return;
   -    update_temperature(tz);
-

I think the patch does a bit more. Previously we continued running the
code below even when the thermal_zone_get_temp() returned an error (due
to various reasons). Now we stop and probably would not schedule next
polling, not calling:
handle_thermal_trip() and monitor_thermal_zone()

I agree there is a change in the behavior.

I would left update_temperature(tz) as it was and not check the return.
The function thermal_zone_get_temp() can protect itself from missing
tz->ops->get_temp(), so we should be safe.

What do you think?

Does it make sense to handle the trip point if we are unable to read the
temperature?

The lines following the update_temperature() are:

  - thermal_zone_set_trips() which needs a correct tz->temperature

  - handle_thermal_trip() which needs a correct tz->temperature to
compare with

  - monitor_thermal_zone() which needs a consistent tz->passive. This one
is updated by the governor which is in an inconsistent state because the
temperature is not updated.

The problem I see here is how the interrupt mode and the polling mode
are existing in the same code path.

The interrupt mode can call thermal_notify_framework() for critical/hot
trip points without being followed by a monitoring. But for the other
trip points, the get_temp is needed.

Yes, I agree that we can bail out when there is no .get_temp() callback
and even not schedule next polling in such case.
But I am just not sure if we can bail out and not schedule the next
polling, when there is .get_temp() populated and the driver returned
an error only at that moment, e.g. indicating some internal temporary,
issue like send queue full, so such as -EBUSY, or -EAGAIN, etc.
The thermal_zone_get_temp() would pass the error to update_temperature()
but we return, losing the next try. We would not check the temperature
again.


Some links to point you to such code where sensor has to deal
with protocol and various error reasons [1][2][3]


[1] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/sensors.c#L230
[2] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/driver.c#L425
[3] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/driver.c#L474