Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning

From: Logan Gunthorpe
Date: Thu Jul 18 2019 - 20:40:03 EST




On 2019-07-18 6:25 p.m., Sagi Grimberg wrote:
>
>> With multipath enabled, nvme_scan_work() can read from the
>> device (through nvme_mpath_add_disk()). However, with fabrics,
>> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
>> (see nvmf_check_ready()).
>>
>> After setting the state to deleting, nvme_remove_namespaces() will
>> hang waiting for scan_work to flush and these tasks will hang.
>>
>> To fix this, ensure we take scan_lock before changing the ctrl-state.
>> Also, ensure the state is checked while the lock is held
>> in nvme_scan_lock_work().
>
> That's a big hammer...

I didn't think the scan_lock was that contested or that
nvme_change_ctrl_state() was really called that often...

> But this is I/O that we cannot have queued until we have a path..
>
> I would rather have nvme_remove_namespaces() requeue all I/Os for
> namespaces that serve as the current_path and have the make_request
> routine to fail I/O if all controllers are deleting as well.
>
> Would something like [1] (untested) make sense instead?

I'll have to give this a try next week and I'll let you know then. It
kind of makes sense to me but a number of things I tried to fix this
that I thought made sense did not work.

>
>> +ÂÂÂ mutex_lock(&ctrl->scan_lock);
>> +
>> ÂÂÂÂÂ if (ctrl->state != NVME_CTRL_LIVE)
>> ÂÂÂÂÂÂÂÂÂ return;
>
> unlock

If we unlock here and relock below, we'd have to recheck the ctrl->state
to avoid any races. If you don't want to call nvme_identify_ctrl with
the lock held, then it would probably be better to move the state check
below it.

>> Â @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct
>> *work)
>> ÂÂÂÂÂ if (nvme_identify_ctrl(ctrl, &id))
>> ÂÂÂÂÂÂÂÂÂ return;
>
> unlock
>
>
> [1]:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 76cd3dd8736a..627f5871858d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3576,6 +3576,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> ÂÂÂÂÂÂÂ struct nvme_ns *ns, *next;
> ÂÂÂÂÂÂÂ LIST_HEAD(ns_list);
>
> +ÂÂÂÂÂÂ mutex_lock(&ctrl->scan_lock);
> +ÂÂÂÂÂÂ list_for_each_entry(ns, &ctrl->namespaces, list)
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nvme_mpath_clear_current_path(ns);
> +ÂÂÂÂÂÂ mutex_lock(&ctrl->scan_lock);
> +
> ÂÂÂÂÂÂÂ /* prevent racing with ns scanning */
> ÂÂÂÂÂÂÂ flush_work(&ctrl->scan_work);
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a9a927677970..da1731266788 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -231,6 +231,24 @@ inline struct nvme_ns *nvme_find_path(struct
> nvme_ns_head *head)
> ÂÂÂÂÂÂÂ return ns;
> Â}
>
> +static bool nvme_available_path(struct nvme_ns_head *head)
> +{
> +ÂÂÂÂÂÂ struct nvme_ns *ns;
> +
> +ÂÂÂÂÂÂ list_for_each_entry_rcu(ns, &head->list, siblings) {
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ switch (ns->ctrl->state) {
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ case NVME_CTRL_LIVE:
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ case NVME_CTRL_RESETTING:
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ case NVME_CTRL_CONNECTING:
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /* fallthru */
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return true;
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ default:
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ break;
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
> +ÂÂÂÂÂÂ }
> +ÂÂÂÂÂÂ return false;
> +}
> +
> Âstatic blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct bio *bio)
> Â{
> @@ -257,14 +275,14 @@ static blk_qc_t nvme_ns_head_make_request(struct
> request_queue *q,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disk_devt(ns->head->disk),
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ bio->bi_iter.bi_sector);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ret = direct_make_request(bio);
> -ÂÂÂÂÂÂ } else if (!list_empty_careful(&head->list)) {
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ dev_warn_ratelimited(dev, "no path available - requeuing
> I/O\n");
> +ÂÂÂÂÂÂ } else if (nvme_available_path(head)) {
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ dev_warn_ratelimited(dev, "no usable path - requeuing
> I/O\n");
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ spin_lock_irq(&head->requeue_lock);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ bio_list_add(&head->requeue_list, bio);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ spin_unlock_irq(&head->requeue_lock);
> ÂÂÂÂÂÂÂ } else {
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ dev_warn_ratelimited(dev, "no path - failing I/O\n");
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ dev_warn_ratelimited(dev, "no available path - failing
> I/O\n");
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ bio->bi_status = BLK_STS_IOERR;
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ bio_endio(bio);
> --