Re: [PATCH] thermal: core: Add a back up thermal shutdown mechanism

From: Zhang Rui
Date: Wed Apr 12 2017 - 04:45:49 EST


On Wed, 2017-04-12 at 14:06 +0530, Keerthy wrote:
>
> On Wednesday 12 April 2017 01:56 PM, Zhang Rui wrote:
> >
> > On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote:
> > >
> > >
> > > On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote:
> > > >
> > > >
> > > > Keerthy,
> > > >
> > > > On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote:
> > > > > >
> > > > > >
> > > > > > On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hey,
> > > > > > > >
> > > > > > > > On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > off).
> > > > <cut>
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > OK... This seams to me, still a corner case supposed to
> > > > > > > > be
> > > > > > > > fixed at
> > > > > > > > orderly_power_off, not at thermal. But..
> > > > > > > >
> > > > ^^^ Then again, this must be fixed not at thermal core. And re-
> > > > reading
> > > > the history of this thread, this seams to be really something
> > > > broken at
> > > > OMAP/DRA7, as mentioned in previous messages. That is probably
> > > > why
> > > > you
> > > > are pushing for pm_power_off(), which seams to be the one that
> > > > works for
> > > > you, pulling the plug correctly (DRA7).
> > > Zhang/Eduardo,
> > >
> > > OMAP5/DRA7 is one case.
> > >
> > > I believe i this is the root cause of this failure.
> > >
> > > thermal_zone_device_check --> thermal_zone_device_update -->
> > > handle_thermal_trip --> handle_critical_trips -->
> > > orderly_poweroff
> > >
> > > The above sequence happens every 250/500 mS based on the
> > > configuration.
> > > The orderly_poweroff function is getting called every 250/500 mS
> > > and
> > > i
> > > see with a full fledged nfs file system it takes at least 5-10
> > > Seconds
> > > to shutdown and during that time we bombard with orderly_poweroff
> > > calls
> > > multiple times due to the thermal_zone_device_check triggering
> > > periodically.
> > >
> > > To confirm that i made sure that handle_critical_trips calls
> > > orderly_poweroff only once and i no longer see the failure on
> > > DRA72-
> > > EVM
> > > board.
> > >
> > Nice catch!
> Thanks.
>
> >
> >
> > >
> > > So IMHO once we get to handle_critical_trips case where we do
> > > orderly_poweroff we need to do the following:
> > >
> > > 1) Make sure that orderly_poweroff is called only once.
> > agreed.
> >
> > >
> > > 2) Cancel all the scheduled work queues to monitor the
> > > temperature as
> > > we have already reached a point of shutting down the system.
> > >
> > agreed.
> >
> > now I think we've found the root cause of the problem.
> > orderly_poweroff() is not reenterable and it does not have to be.
> > If we're using orderly_poweroff() for emergency power off, we have
> > to
> > use it correctly.
> >
> > will you generate a patch to do this?
> Sure. I will generate a patch to take care of 1) To make sure that
> orderly_poweroff is called only once right away. I have already
> tested.
>
> for 2) Cancel all the scheduled work queues to monitor the
> temperature.
> I will take some more time to make it and test.
>
> Is that okay? Or you want me to send both together?
>
I think you can send patch for step 1 first.

thanks,
rui
> Regards,
> Keerthy
>
> >
> >
> > thanks,
> > rui
> >
> > >
> > > Let me know your thoughts on this.
> > >
> > > Best Regards,
> > > Keerthy
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > However, there is no clean way of detecting such
> > > > > > > > > failure
> > > > > > > > > of
> > > > > > > > > userspace
> > > > > > > > > powering off the system. In such scenarios, it is
> > > > > > > > > necessary for a
> > > > > > > > > backup
> > > > > > > > > workqueue to be able to force a shutdown of the
> > > > > > > > > system
> > > > > > > > > when
> > > > > > > > > orderly
> > > > > > > > > shutdown is not successful after a configurable time
> > > > > > > > > period.
> > > > > > > > >
> > > > > > > > Given that system running hot is a thermal issue, I
> > > > > > > > guess
> > > > > > > > we care
> > > > > > > > more
> > > > > > > > on this matter then..
> > > > > > > Yes!
> > > > > > >
> > > > > > I just read this thread again https://patchwork.kernel.org/
> > > > > > patc
> > > > > > h/802458
> > > > > > 1/ to recall the previous discussion.
> > > > > >
> > > > > > https://patchwork.kernel.org/patch/8149891/
> > > > > > https://patchwork.kernel.org/patch/8149861/
> > > > > > should be the solution made based on Ingo' suggestion,
> > > > > > right?
> > > > > >
> > > > > > And to me, this sounds like the right direction to go,
> > > > > > thermal
> > > > > > does not
> > > > > > need a back up shutdown solution, it just needs a kernel
> > > > > > function call
> > > > > > which guarantees the system can be shutdown/reboot
> > > > > > immediately.
> > > > > >
> > > > > > is there any reason that patch 1/2 is not accepted?
> > > > > Zhang,
> > > > >
> > > > > http://www.serverphorums.com/read.php?12,1400964
> > > > >
> > > > > I got a NAK from Alan and was given this direction on a
> > > > > thermal_poweroff
> > > > > which is more or less what is done in this patch.
> > > > >
> > > > Actually, Alan's suggestion is more for you to define a
> > > > thermal_poweroff() that can be defined per architecture.
> > > >
> > > > Also, please, keep track of your patch versions and also do
> > > > copy
> > > > everybody who has stated their opinion on previous discussions.
> > > > These
> > > > patches must have Ingo, Alan, and RMK copied too. In this way
> > > > we
> > > > avoid
> > > > loosing track of what has been suggested and we also converge
> > > > faster to
> > > > something everybody (or most of us) agree. Next version,
> > > > please,
> > > > fix
> > > > that.
> > > >
> > > >
> > > > To me, thermal core needs a function that simply powers off the
> > > > system.
> > > > No timeouts, delayed works, backups, etc. Simple and straight.
> > > >
> > > > The idea of having a per architecture implementation, as per
> > > > Alan's
> > > > suggestion, makes sense to me too. Having something different
> > > > from
> > > > pm_power_off(), specific to thermal, might also give the
> > > > opportunity to
> > > > save the power off reason.
> > > >
> > > > BR,
> > > >
> > > > Eduardo Valentin
> > > >