Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting

From: Sagi Grimberg
Date: Mon Jun 05 2023 - 19:09:19 EST



From: Chunguang Xu <chunguang.xu@xxxxxxxxxx>

We found that nvme_remove_namespaces() may hang in flush_work(&ctrl->scan_work)
while removing ctrl. The root cause may due to the state of ctrl changed to
NVME_CTRL_DELETING while removing ctrl , which intterupt nvme_tcp_error_recovery_work()/
nvme_reset_ctrl_work()/nvme_tcp_reconnect_or_remove(). At this time, ctrl is
freezed and queue is quiescing . Since scan_work may continue to issue IOs to
load partition table, make it blocked, and lead to nvme_tcp_error_recovery_work()
hang in flush_work(&ctrl->scan_work).

After analyzation, we found that there are mainly two case:
1. Since ctrl is freeze, scan_work hang in __bio_queue_enter() while it issue
new IO to load partition table.
2. Since queus is quiescing, requeue timeouted IO may hang in hctx->dispatch
queue, leading scan_work waiting for IO completion.

Hey, can you please look at the discussion with Mings' proposal in
"nvme: add nvme_delete_dead_ctrl for avoiding io deadlock" ?

Looks the same to me.