Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning

From: Logan Gunthorpe
Date: Wed Jul 24 2019 - 15:12:13 EST


Hey,

Sorry for the delay.

I tested your patch and it does work. Do you want me to send your change
as a full patch? Can I add your signed-off-by?

On 2019-07-18 6:50 p.m., Sagi Grimberg wrote:
>> I didn't think the scan_lock was that contested or that
>> nvme_change_ctrl_state() was really called that often...
>
> it shouldn't be, but I think it makes the flow more convoluted
> as we serialize by flushing the scan_work right after...

I would argue that the check for state in nvme_scan_work() without a
lock is racy and confusing. There's nothing to prevent the state from
changing immediately after the check.

> The design principal is met as we do get the I/O failing,
> but its just that with mpath we simply queue the I/O again
> because the head->list happens to not be empty.
> Perhaps taking care of that check is cleaner.

Yes, I feel your patch is a good solution on it's own merits.
> Thanks. Do you have a firm reproducer for it?

Yes. If you connect to and then immediately disconnect from a target (at
least with nvme-loop) you will reliably trigger this bug -- or one of
the others I've sent patches for.

>>>> +ÂÂÂ mutex_lock(&ctrl->scan_lock);
>>>> +
>>>> ÂÂÂÂÂÂ if (ctrl->state != NVME_CTRL_LIVE)
>>>> ÂÂÂÂÂÂÂÂÂÂ return;
>>>
>>> unlock
>>
>> If we unlock here and relock below, we'd have to recheck the ctrl->state
>> to avoid any races. If you don't want to call nvme_identify_ctrl with
>> the lock held, then it would probably be better to move the state check
>> below it.
>
> Meant before the return statement.

Ah, right, my mistake.

Logan