Re: 3.13.?: Strange / dangerous fan policy...

From: Rafael J. Wysocki
Date: Sun Mar 09 2014 - 13:43:14 EST


On Sunday, March 09, 2014 01:10:25 AM Manuel Krause wrote:
> On 2014-03-08 16:59, Guenter Roeck wrote:
> > On 03/08/2014 03:08 AM, Jean Delvare wrote:
> >> On Fri, 7 Mar 2014 14:52:30 -0800, Guenter Roeck wrote:
> >>> On Fri, Mar 07, 2014 at 11:04:29PM +0100, Manuel Krause wrote:
> >>>> Hi, and thanks for the quick response!
> >>>> No special fancy "fan control policy". 'fancontrol' isn't up or
> >>>> running.
> >>>> Vanilla kernels 3.11.* and 3.12.* had been working on here
> >>>> without
> >>>> any extra work.
> >>>> --
> >>>> # sensors
> >>>> acpitz-virtual-0
> >>>> Adapter: Virtual device
> >>>> temp1: +71.0ÂC (crit = +256.0ÂC)
> >>>> temp2: +69.0ÂC (crit = +110.0ÂC)
> >>>> temp3: +52.0ÂC (crit = +105.0ÂC)
> >>>> temp4: +25.0ÂC (crit = +110.0ÂC)
> >>>> temp5: +58.0ÂC (crit = +110.0ÂC)
> >>>>
> >>>> coretemp-isa-0000
> >>>> Adapter: ISA adapter
> >>>> Core 0: +62.0ÂC (high = +105.0ÂC, crit = +105.0ÂC)
> >>>> Core 1: +60.0ÂC (high = +105.0ÂC, crit = +105.0ÂC)
> >>>> --
> >>>> My notebook (HP/Compaq 6730b) does not have a seperate fan
> >>>> sensor.
> >>>> This is with 3.12.13 with my normal workload.
> >>>>
> >>>> Please, trust my above mentionned values of 94 ÂC vs. 74ÂC as I
> >>>> don't like to boot 3.13.6 anymore, to avoid harm to the
> >>>> notebook's
> >>>> casing.
> >>>
> >>> Understood. Unfortunately, we'll need to get information
> >>> from the new kernel to be able to track down the problem.
> >>
> >> Indeed. Not only the run-time temperatures, but also the high
> >> and crit
> >> limits.
> >>
> >>>> But I'd do to test any improvement-patch.
> >>>
> >>> So far I have no idea what is going on. I don't see anything
> >>> in the
> >>> drivers providing above data that would explain the behavior,
> >>> but I might be missing something.
> >>
> >> Looks like a regression in the acpi subsystem or in power
> >> management,
> >> not hwmon. Hwmon is merely reporting the temperatures, it's not
> >> responsible for the actual temperatures.
> >>
> >
> > I would agree. I don't think we have enough information to be sure,
> > though. There might be some unintended interaction or interference.
> >
> > gpu is a good hint ... for example, look at commit b9ed919f1c8
> > (drm/nouveau/drm/pm: remove everything except the hwmon interfaces
> > to THERM). nouveau does export pwm and fan control information,
> > so any change in that code may have unintended side effects.
> > Similar, I don't know how ec39f64bba (drm/radeon/dpm: Convert to
> > use devm_hwmon_register_with_groups) could have the observed impact,
> > as it is purely passive, but I prefer to be rather safe than sorry.
> >
> > This problem has now been submitted into bugzilla as
> > https://bugzilla.kernel.org/show_bug.cgi?id=71711.
> >
> > Guenter
> >
>
> Sorry, for beeing late, had to search for/accumulate much info
> for you...
> I hope, you like me to put it into one answer to you all CCing you.
>
> My GFX is a GM45 Intel (mobile), shared memory, running the
> opensource Mesa drivers/extensions.
> kernel-module: i915
>
> According to the output of 'cpupower': I have
> CPUidle driver: acpi_idle
> CPUidle governor: menu
>
> CPUfreq:
> driver: acpi-cpufreq
> available cpufreq governors: ondemand, performance
> -
> And "ondemand" is running.
> --
>
> # sensors
> acpitz-virtual-0
> Adapter: Virtual device
> temp1: +41.0ÂC (crit = +256.0ÂC)
> temp2: +92.0ÂC (crit = +110.0ÂC)
> temp3: +71.0ÂC (crit = +105.0ÂC)
> temp4: +26.5ÂC (crit = +110.0ÂC)
> temp5: +25.0ÂC (crit = +110.0ÂC)
>
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0: +86.0ÂC (high = +105.0ÂC, crit = +105.0ÂC)
> Core 1: +84.0ÂC (high = +105.0ÂC, crit = +105.0ÂC)
>
> FROM a critical "smelly" situation today, kernel-compilation, fan
> @100%.
> --
>
> Additional findings:
>
> Identification from bootup ACPI initialisation vs. sensors:
> temp1 = DTSZ
> temp2 = CPUZ --> triggering Cooling in 3.12.13 if > 74ÂC
> temp3 = SKNZ
> temp4 = BATZ "Battery Zone" always calm ~ +6ÂC of ambient T
> temp5 = FDTZ --- in 3.12.13 a representation of the cooling-fan
> (25 - 45 - 58 - max?)
> Core 0 & Core 1 are the internal CPU T sensors.
>
> With the 3.13.x (.5+) kernels the first gatherered cooling
> settings from bootup do stay forever. Means, rebooting a hot
> system will get a FDTZ @45ÂC+ and won't make any problems, as it
> does cool enough (even for kernel compiling on here). If it gets
> 25ÂC @bootup the system goes into emergency cooling somewhen.
> Same is with a suspend/resume.
>
> Kernel 3.12.13 adjusts the cooling on it's own, but appropriately.

This almost certainly is an ACPI regression, but I'm not sure whether
thermal management or CPU power management is broken on your system.

Can you compare the contents of /sys/class/thermal/ from working and
not working kernels, please?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/