Re: [PATCH v3] hwmon: (k10temp) Enable AMD3255 Proc to show negative temperature

From: Limonciello, Mario
Date: Mon Jun 19 2023 - 14:53:19 EST



On 6/19/2023 1:40 PM, Guenter Roeck wrote:
On 6/19/23 11:02, Limonciello, Mario wrote:

On 6/19/2023 12:07 PM, Guenter Roeck wrote:
On 6/19/23 09:54, Baskaran Kannan wrote:
Industrial processor i3255 supports temperatures -40 deg celcius
to 105 deg Celcius. The current implementation of k10temp_read_temp
rounds off any negative
temperatures to '0'. To fix this, the following changes have been made.
Added a flag 'disp_negative' to struct k10temp_data to support
AMD i3255 processors. Flag 'disp_negative' is set if 3255 processor
is found during k10temp_probe.  Flag 'disp_negative' is used to determine
whether to round off negative temperatures to '0' in k10temp_read_temp.

Signed-off-by: Baskaran Kannan <Baski.Kannan@xxxxxxx>

Now you have made changes you were not asked to make, extended the flag
to cover a range of processors instead of just i3255, and did not provide
a change log nor a comment in the code describing why processors with
certain model numbers should display negative temperatures.

i3255 happens to be one of the industrial processors in family 17h models
01h through 08h.  These are potentially used at subzero temperatures and
so displaying negative numbers makes a lot sense.

So I think the commit message needs to be be amended to better explain that.

I guided Kannan against leaving a comment in the code with specific models
because it either won't age well as other industrial processors are
introduced or may need to be ping-ponged each time.

That only applies if there is a guarantee that the check does not
inadvertently ends up displaying negative temperatures for other CPUs
which are misconfigured. After all, the current code is just a hack
working around some problem with bad temperatures reported on other CPUs.
Personally I'd rather have a clean fix for that. If/since that is not
available, whatever is done subsequently (including the code suggested here)
is just a hack.

... and if a hack on top of a hack is introduced, we need to make sure that
it does not undo the previous hack.

But perhaps it should be more generic like:

/* Industrial processors may be used at sub zero temperatures */


You can not just display negative temperatures for family 0x17h models
0x00..0x07 without explanation. The above needs to be documented.
I fail to understand why a variant of

"i3255 happens to be one of the industrial processors in family 17h models
 01h through 08h.  These are potentially used at subzero temperatures and
 so displaying negative numbers makes a lot sense."

can not be added as comment and description if that is exactly what the code
checks for. Something like

"Family 17h models 01h through 08h are industrial processors with an operational
 temperature of -40°C - 105°C and may be used at subzero temperatures.
 Display negative temperatures for those processors."

makes perfect sense to me. Only of course it is incorrect ...

Model 0x1 was used for the original Zen, and 0x8 is Zen+. 1950X is family 0x17 model
0x01 per cpuinfo, meaning your hack undoes the original hack, and the bad
temperatures would again be displayed for the affected systems. That is simply
unacceptable.
I just double checked the documentation and you're correct.
To target just the industrial ones it would need to be > 0x00 and <= 0x08 ALONG with
checking the model_id_str value against the industrial ones.

Yes, it may be a pain to find an acceptable hack to solve the problem,
but after all this is a self-inflicted problem, so it can't be helped.
The alternative would always be to find a better means to identify CPUs
affected by the original problem. If that is not possible, explicitly listing CPUs
which are _not_ affected is the only possible alternative.
So Pinnacle Ridge and Summit Ridge (Zen/Zen+) have model_id_str values of 'B1' and 'B2'.
I think we should be able to detect those and only avoid showing the negative values when:

* Family 17h
* Model > 0x00
* Model <= 0x0f
* Model ID str B1 or B2


Note that the code sets disp_negative for model numbers < 0x8, meaning it
does not include model 0x8. It also sets disp_negative for model 0x00 which is
specifically excluded above.

All that is no excuse for not providing change logs.
Appreciate your being thorough.

Guenter

Guenter

---
  drivers/hwmon/k10temp.c | 8 ++++++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
index 7b177b9fbb09..2613420d43ff 100644
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@ -86,6 +86,7 @@ struct k10temp_data {
      u32 show_temp;
      bool is_zen;
      u32 ccd_offset;
+    bool disp_negative;
  };
    #define TCTL_BIT    0
@@ -204,12 +205,12 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
          switch (channel) {
          case 0:        /* Tctl */
              *val = get_raw_temp(data);
-            if (*val < 0)
+            if (*val < 0 && !data->disp_negative)
                  *val = 0;
              break;
          case 1:        /* Tdie */
              *val = get_raw_temp(data) - data->temp_offset;
-            if (*val < 0)
+            if (*val < 0 && !data->disp_negative)
                  *val = 0;
              break;
          case 2 ... 13:        /* Tccd{1-12} */
@@ -405,6 +406,9 @@ static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id)
      data->pdev = pdev;
      data->show_temp |= BIT(TCTL_BIT);    /* Always show Tctl */
  +    if (boot_cpu_data.x86 == 0x17 && boot_cpu_data.x86_model < 0x8)
+        data->disp_negative = true;
+
      if (boot_cpu_data.x86 == 0x15 &&
          ((boot_cpu_data.x86_model & 0xf0) == 0x60 ||
           (boot_cpu_data.x86_model & 0xf0) == 0x70)) {