RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Mario.Limonciello
Date: Thu May 09 2019 - 18:21:16 EST


> -----Original Message-----
> From: Keith Busch <kbusch@xxxxxxxxxx>
> Sent: Thursday, May 9, 2019 4:54 PM
> To: Limonciello, Mario
> Cc: kai.heng.feng@xxxxxxxxxxxxx; hch@xxxxxx; axboe@xxxxxx;
> sagi@xxxxxxxxxxx; rafael@xxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx;
> rafael.j.wysocki@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> nvme@xxxxxxxxxxxxxxxxxxx; keith.busch@xxxxxxxxx
> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
> Suspend-to-Idle
>
>
> [EXTERNAL EMAIL]
>
> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@xxxxxxxx wrote:
> > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> > > +{
> > > + int ret;
> > > +
> > > + mutex_lock(&ctrl->scan_lock);
> > > + nvme_start_freeze(ctrl);
> > > + nvme_wait_freeze(ctrl);
> > > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> > > + NULL);
> > > + nvme_unfreeze(ctrl);
> > > + mutex_unlock(&ctrl->scan_lock);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(nvme_set_power);
> >
> > I believe without memory barriers at the end disks with HMB this will
> > still kernel panic (Such as Toshiba BG3).
>
> Well, the mutex has an implied memory barrier, but your HMB explanation
> doesn't make much sense to me anyway. The "mb()" in this thread's original
> patch is a CPU memory barrier, and the CPU had better not be accessing
> HMB memory. Is there something else going on here?

Kai Heng will need to speak up a bit in his time zone as he has this disk on hand,
but what I recall from our discussion was that DMA operation MemRd after
resume was the source of the hang.

>
> > This still allows D3 which we found at least failed to go into deepest state and
> blocked
> > platform s0ix for the following SSDs (maybe others):
> > Hynix PC601
> > LiteOn CL1
>
> We usually write features to spec first, then quirk non-compliant
> devices after.

NVME spec doesn't talk about a relationship between SetFeatures w/
NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.

This is why we opened a dialog with storage vendors, including contrasting the behavior
of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.

Those two I mention that come to mind immediately because they were most recently
tested to fail. Our discussion with storage vendors overwhelmingly requested
that we don't use D3 under S2I because their current firmware architecture won't
support it.

For example one vendor told us with current implementation that receiving D3hot
after NVME shutdown will prevent being able to enter L1.2. D3hot entry was supported
by an IRQ handler that isn't serviced in NVME shutdown state.

Another vendor told us that with current implementation it's impossible to transition
to PS4 (at least via APST) while L1.2 D3hot is active.