Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock

From: Mikulas Patocka
Date: Tue Oct 20 2015 - 16:03:28 EST




On Sun, 18 Oct 2015, Ming Lei wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > From: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request. When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately. The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory. However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call. Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
>
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
for every bio. It wouldn't be practical to convert them to not acquire the
mutex (and it would also degrade performance of these drivers, if they had
to offload every bio to a worker thread that can acquire the mutex).

You can't acquire a mutex in .request_fn() because .request_fn() is called
with queue spinlock held.

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

You can add blk_start_plug/blk_finish_plug to the function
bio_alloc_rescue. That would be reasonable to make sure that the requests
are merged even when they are offloaded to rescue thread.

> > - queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + spin_lock(&bs->rescue_lock);
> > + bio_list_add(&bs->rescue_list, bio);
> > + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + spin_unlock(&bs->rescue_lock);
> > + }
>
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
>
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Processing requests doesn't block (they only take the queue spinlock).

Processing bios can block (they can take other mutexes or semaphores), so
processing them from the schedule hook is impossible - the bio's
make_request function could attempt to take some lock that is already
held. So - we must offload the bios to a separate workqueue.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/