Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Kai-Heng Feng
Date: Fri May 10 2019 - 02:07:12 EST


at 06:19, <Mario.Limonciello@xxxxxxxx> <Mario.Limonciello@xxxxxxxx> wrote:

-----Original Message-----
From: Keith Busch <kbusch@xxxxxxxxxx>
Sent: Thursday, May 9, 2019 4:54 PM
To: Limonciello, Mario
Cc: kai.heng.feng@xxxxxxxxxxxxx; hch@xxxxxx; axboe@xxxxxx;
sagi@xxxxxxxxxxx; rafael@xxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx;
rafael.j.wysocki@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
nvme@xxxxxxxxxxxxxxxxxxx; keith.busch@xxxxxxxxx
Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
Suspend-to-Idle


[EXTERNAL EMAIL]

On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@xxxxxxxx wrote:
+int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
+{
+ int ret;
+
+ mutex_lock(&ctrl->scan_lock);
+ nvme_start_freeze(ctrl);
+ nvme_wait_freeze(ctrl);
+ ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
+ NULL);
+ nvme_unfreeze(ctrl);
+ mutex_unlock(&ctrl->scan_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_set_power);

I believe without memory barriers at the end disks with HMB this will
still kernel panic (Such as Toshiba BG3).

Well, the mutex has an implied memory barrier, but your HMB explanation
doesn't make much sense to me anyway. The "mb()" in this thread's original
patch is a CPU memory barrier, and the CPU had better not be accessing
HMB memory. Is there something else going on here?

Kai Heng will need to speak up a bit in his time zone as he has this disk on hand,
but what I recall from our discussion was that DMA operation MemRd after
resume was the source of the hang.

Yes, thatâ what I was told by the NVMe vendor, so all I know is to impose a memory barrier.
If mb() shouldnât be used here, whatâs the correct variant to use in this context?


This still allows D3 which we found at least failed to go into deepest state and
blocked
platform s0ix for the following SSDs (maybe others):
Hynix PC601
LiteOn CL1

We usually write features to spec first, then quirk non-compliant
devices after.

NVME spec doesn't talk about a relationship between SetFeatures w/
NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.

This is why we opened a dialog with storage vendors, including contrasting the behavior
of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.

Those two I mention that come to mind immediately because they were most recently
tested to fail. Our discussion with storage vendors overwhelmingly requested
that we don't use D3 under S2I because their current firmware architecture won't
support it.

For example one vendor told us with current implementation that receiving D3hot
after NVME shutdown will prevent being able to enter L1.2. D3hot entry was supported
by an IRQ handler that isn't serviced in NVME shutdown state.

Another vendor told us that with current implementation it's impossible to transition
to PS4 (at least via APST) while L1.2 D3hot is active.

I tested the patch from Keith and it has two issues just as simply skipping nvme_dev_disable():
1) It consumes more power in S2I
2) System freeze after resume

Also I donât think itâs a spec. Itâs a guide to let OEM/ODM pick and assemble Modern Standby compliant machines, so what Windows NVMe driver really does is still opaque to us.

Kai-Heng