Re: [PATCH v2] nvme: fix reconnection fail due to reserved tag allocation

From: Sagi Grimberg
Date: Thu Mar 07 2024 - 06:19:30 EST




On 07/03/2024 13:06, brookxu.cn wrote:
From: Chunguang Xu <chunguang.xu@xxxxxxxxxx>

We found a issue on production environment while using NVMe
over RDMA, admin_q reconnect failed forever while remote
target and network is ok. After dig into it, we found it
may caused by a ABBA deadlock due to tag allocation. In my
case, the tag was hold by a keep alive request waiting
inside admin_q, as we quiesced admin_q while reset ctrl,
so the request maked as idle and will not process before
reset success. As fabric_q shares tagset with admin_q,
while reconnect remote target, we need a tag for connect
command, but the only one reserved tag was held by keep
alive command which waiting inside admin_q. As a result,
we failed to reconnect admin_q forever. In order to fix
this issue, I think we should keep two reserved tags for
admin queue.

Fixes: ed01fee283a0 ("nvme-fabrics: only reserve a single tag")
Signed-off-by: Chunguang Xu <chunguang.xu@xxxxxxxxxx>

Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>