Re: [PATCH v2] nvme: Add hardware monitoring support

From: Akinobu Mita
Date: Wed Oct 30 2019 - 07:17:30 EST


2019å10æ30æ(æ) 7:32 Guenter Roeck <linux@xxxxxxxxxxxx>:
>
> nvme devices report temperature information in the controller information
> (for limits) and in the smart log. Currently, the only means to retrieve
> this information is the nvme command line interface, which requires
> super-user privileges.
>
> At the same time, it would be desirable to use NVME temperature information
> for thermal control.
>
> This patch adds support to read NVME temperatures from the kernel using the
> hwmon API and adds temperature zones for NVME drives. The thermal subsystem
> can use this information to set thermal policies, and userspace can access
> it using libsensors and/or the "sensors" command.
>
> Example output from the "sensors" command:
>
> nvme0-pci-0100
> Adapter: PCI adapter
> Composite: +39.0ÂC (high = +85.0ÂC, crit = +85.0ÂC)
> Sensor 1: +39.0ÂC
> Sensor 2: +41.0ÂC
>
> Signed-off-by: Guenter Roeck <linux@xxxxxxxxxxxx>
> ---
> v2: Use devm_kfree() to release memory in error path
>
> drivers/nvme/host/Kconfig | 10 ++
> drivers/nvme/host/Makefile | 1 +
> drivers/nvme/host/core.c | 5 +
> drivers/nvme/host/nvme-hwmon.c | 163 +++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 8 ++
> 5 files changed, 187 insertions(+)
> create mode 100644 drivers/nvme/host/nvme-hwmon.c
>
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index 2b36f052bfb9..aeb49e16e386 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -23,6 +23,16 @@ config NVME_MULTIPATH
> /dev/nvmeXnY device will show up for each NVMe namespaces,
> even if it is accessible through multiple controllers.
>
> +config NVME_HWMON
> + bool "NVME hardware monitoring"
> + depends on (NVME_CORE=y && HWMON=y) || (NVME_CORE=m && HWMON)
> + help
> + This provides support for NVME hardware monitoring. If enabled,
> + a hardware monitoring device will be created for each NVME drive
> + in the system.
> +
> + If unsure, say N.
> +
> config NVME_FABRICS
> tristate
>
> diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
> index 8a4b671c5f0c..03de4797a877 100644
> --- a/drivers/nvme/host/Makefile
> +++ b/drivers/nvme/host/Makefile
> @@ -14,6 +14,7 @@ nvme-core-$(CONFIG_TRACING) += trace.o
> nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o
> nvme-core-$(CONFIG_NVM) += lightnvm.o
> nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS) += fault_inject.o
> +nvme-core-$(CONFIG_NVME_HWMON) += nvme-hwmon.o
>
> nvme-y += pci.o
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa7ba09dca77..fc1d4b146717 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2796,6 +2796,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> ctrl->oncs = le16_to_cpu(id->oncs);
> ctrl->mtfa = le16_to_cpu(id->mtfa);
> ctrl->oaes = le32_to_cpu(id->oaes);
> + ctrl->wctemp = le16_to_cpu(id->wctemp);
> + ctrl->cctemp = le16_to_cpu(id->cctemp);
> +
> atomic_set(&ctrl->abort_limit, id->acl + 1);
> ctrl->vwc = id->vwc;
> if (id->mdts)
> @@ -2897,6 +2900,8 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>
> ctrl->identified = true;
>
> + nvme_hwmon_init(ctrl);
> +
> return 0;
>
> out_free:

The nvme_init_identify() can be called multiple time in nvme ctrl's
lifetime (e.g 'nvme reset /dev/nvme*' or suspend/resume paths), so
should we need to prevent nvme_hwmon_init() from registering hwmon
device more than twice?

In the nvme thermal zone patchset[1], thernal zone is registered in
nvme_init_identify and unregistered in nvme_stop_ctrl().

[1] https://lore.kernel.org/linux-devicetree/1561990354-4084-2-git-send-email-akinobu.mita@xxxxxxxxx/

> diff --git a/drivers/nvme/host/nvme-hwmon.c b/drivers/nvme/host/nvme-hwmon.c
> new file mode 100644
> index 000000000000..af5eda326ec6
> --- /dev/null
> +++ b/drivers/nvme/host/nvme-hwmon.c
> @@ -0,0 +1,163 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NVM Express hardware monitoring support
> + * Copyright (c) 2019, Guenter Roeck
> + */
> +
> +#include <linux/hwmon.h>
> +
> +#include "nvme.h"
> +
> +struct nvme_hwmon_data {
> + struct nvme_ctrl *ctrl;
> + struct nvme_smart_log log;
> +};
> +
> +static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data)
> +{
> + return nvme_get_log(data->ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> + &data->log, sizeof(data->log), 0);
> +}

The 'data->log' is allocated per nvme_ctrl, so are there any locks to
prevent multiple callers of nvme_hwmon_get_smart_log() from breaking
the log buffer?

> +
> +static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> + u32 attr, int channel, long *val)
> +{
> + struct nvme_hwmon_data *data = dev_get_drvdata(dev);
> + struct nvme_smart_log *log = &data->log;
> + int err;
> + int temp;
> +
> + err = nvme_hwmon_get_smart_log(data);
> + if (err)
> + return err < 0 ? err : -EPROTO;
> +
> + switch (attr) {
> + case hwmon_temp_max:
> + *val = (data->ctrl->wctemp - 273) * 1000;
> + break;
> + case hwmon_temp_crit:
> + *val = (data->ctrl->cctemp - 273) * 1000;
> + break;

When this function is called with 'hwmon_temp_max' or 'hwmon_temp_crit',
we don't need to call nvme_hwmon_get_smart_log() at all, do we?

> + case hwmon_temp_input:
> + if (!channel)
> + temp = le16_to_cpup((__le16 *)log->temperature);
> + else
> + temp = le16_to_cpu(log->temp_sensor[channel - 1]);
> + *val = (temp - 273) * 1000;
> + break;
> + case hwmon_temp_crit_alarm:
> + *val = !!(log->critical_warning & NVME_SMART_CRIT_TEMPERATURE);
> + break;
> + default:
> + err = -EOPNOTSUPP;
> + break;
> + }
> + return err;
> +}