Re: [RFC PATCH] nvme: prevent hang on surprise removal of NVMe disk

From: Hannes Reinecke
Date: Wed Feb 16 2022 - 06:32:43 EST


On 2/16/22 12:18, Markus Blöchl wrote:
On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote:
On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote:
After the surprise removal of a mounted NVMe disk the pciehp task
reliably hangs forever with a trace similar to this one:

Do you have a specific reproducer? At least with doing a

echo 1 > /sys/.../remove

while running fsx on a file system I can't actually reproduce it.

We built our own enclosures with a custom connector to plug the disks.

So an external enclosure for thunderbolt is probably very similar.
(or just ripping an unscrewed NVMe out of the M.2 ...)

But as already suggested, qemu might also be very useful here as it also
allows us to test multiple namespaces and multipath I/O, if you/someone
wants to check those too (hotplug with multipath I/O really scares me).

Nothing to be scared of.
I've tested this extensively in the run up to commit 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down") which,
incidentally, is something you need if you want to test things.

Let me see if I can dig up the testbed.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer