Re: [PATCH] hwmon: (k10temp) Report negative temperatures

From: Guenter Roeck
Date: Thu Jun 08 2023 - 15:18:47 EST


On 6/8/23 11:25, Kannan, Baski wrote:
[AMD Official Use Only - General]

To not spawn any new problems, we can go ahead with option 2. i.e., "do not apply it to processors which are known to _not_ be affected by the problem."


Sounds good to me.

Guenter

Thanks
- Baski

-----Original Message-----
From: Guenter Roeck <groeck7@xxxxxxxxx> On Behalf Of Guenter Roeck
Sent: Thursday, June 8, 2023 1:03 PM
To: Kannan, Baski <Baski.Kannan@xxxxxxx>
Cc: Moger, Babu <Babu.Moger@xxxxxxx>; clemens@xxxxxxxxxx; jdelvare@xxxxxxxx; linux-hwmon@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Ramayanam, Pavan <Pavan.Ramayanam@xxxxxxx>
Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On 6/8/23 10:09, Kannan, Baski wrote:
[AMD Official Use Only - General]

The patch you have mentioned, aef17ca12719, sounds like a work-around for a problem found in some Ryzen Threadripper processors.
If I understand correctly, this work-around (aef17ca12719) has been provided as a blanket fix for all the processors.


Due to lack of better knowledge and understanding, yes. See https://github.com/lm-sensors/lm-sensors/issues/70. That doesn't mean that a blanket revert would be appropriate.

The Industrial Processor in question is the Epyc3k i3255.
AMD Family 17h (boot_cpu_data.x86)
AMD model 00h - 0fh (boot_cpu_data.x86_model) Model Name - contains
string "3255"

It supports temperature ranging from -40 degree Celsius to 105 deg Celsius.
We have customers' machines running at -20 deg Celsius. They require that the correct temperature be passed to their tools.


We have two options: Either limit the workaround to the list of processors which may be affected by the original problem, or do not apply it to processors which are known to _not_ be affected by the problem. Either can easily be implemented by adding a flag to struct k10temp_data and setting it in the probe function.

No one outside AMD knows which processors may or may not be affected by the original problem. It was reported on 1950X at the time, but it may exist on all processors with the ability to set Sense MI Skew (and possibly Sense MI Offset), whatever that is. With that in mind, the fix will have to be provided by AMD.

Guenter

-----Original Message-----
From: Guenter Roeck <groeck7@xxxxxxxxx> On Behalf Of Guenter Roeck
Sent: Thursday, June 8, 2023 8:52 AM
To: Kannan, Baski <Baski.Kannan@xxxxxxx>
Cc: Moger, Babu <Babu.Moger@xxxxxxx>; clemens@xxxxxxxxxx;
jdelvare@xxxxxxxx; linux-hwmon@xxxxxxxxxxxxxxx;
linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
Currently, the tctl and die temperatures are rounded off to zero if
they are less than 0. There are industrial processors which work
below zero.

This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only
apply temperature offset if result is positive"). This patch would
effecively revert that change. Given the reason for introducing it I
am not convinced that it is a good idea to unconditionally revert it.


Any comments ? I am not inclined to accept this patch as-is. What are the industrial processors ? Is there a means to detect them ?

Guenter

Guenter


To display the correct temperature remove the rounding off.

Signed-off-by: Baskaran Kannan <Baski.Kannan@xxxxxxx>
---
drivers/hwmon/k10temp.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index
7b177b9fbb09..489ad0b1bc74 100644
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
switch (channel) {
case 0: /* Tctl */
*val = get_raw_temp(data);
- if (*val < 0)
- *val = 0;
break;
case 1: /* Tdie */
*val = get_raw_temp(data) - data->temp_offset;
- if (*val < 0)
- *val = 0;
break;
case 2 ... 13: /* Tccd{1-12} */

amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
--
2.25.1