Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock

From: Mike Snitzer
Date: Tue Oct 20 2015 - 15:57:15 EST


On Sat, Oct 17 2015 at 12:04pm -0400,
Ming Lei <tom.leiming@xxxxxxxxx> wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > From: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request. When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately. The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory. However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call. Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
>
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Which specific locking are you concerned about?

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

True, plug and plug merge wouldn't be usable but this recursive call to
generic_make_request isn't expected to be common.

This patch was to fix a relatively obscure bio spliting scenario that
fell out of the complexity of dm-snapshot.

> > diff --git a/block/bio.c b/block/bio.c
> > index ad3f276..99f5a2ad 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
> > }
> > }
> >
> > -static void punt_bios_to_rescuer(struct bio_set *bs)
> > +/**
> > + * blk_flush_bio_list
> > + * @tsk: task_struct whose bio_list must be flushed
> > + *
> > + * Pop bios queued on @tsk->bio_list and submit each of them to
> > + * their rescue workqueue.
> > + *
> > + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> > + * However, stacking drivers should use bio_set, so this shouldn't be
> > + * an issue.
> > + */
> > +void blk_flush_bio_list(struct task_struct *tsk)
...
> > + while ((bio = bio_list_pop(&list))) {
> > + struct bio_set *bs = bio->bi_pool;
> > + if (unlikely(!bs)) {
> > + bio_list_add(tsk->bio_list, bio);
> > + continue;
> > + }
> >
> > - queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + spin_lock(&bs->rescue_lock);
> > + bio_list_add(&bs->rescue_list, bio);
> > + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + spin_unlock(&bs->rescue_lock);
> > + }
>
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
>
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Flush plug during schedule makes use of kblockd so I'm not sure what
you're referring to here.

> > }
> >
> > /**
> > @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
> > */
> > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > {
> > - gfp_t saved_gfp = gfp_mask;
> > unsigned front_pad;
> > unsigned inline_vecs;
> > unsigned long idx = BIO_POOL_NONE;
> > @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > * reserve.
> > *
> > * We solve this, and guarantee forward progress, with a rescuer
> > - * workqueue per bio_set. If we go to allocate and there are
> > - * bios on current->bio_list, we first try the allocation
> > - * without __GFP_WAIT; if that fails, we punt those bios we
> > - * would be blocking to the rescuer workqueue before we retry
> > - * with the original gfp_flags.
> > + * workqueue per bio_set. If an allocation would block (due to
> > + * __GFP_WAIT) the scheduler will first punt all bios on
> > + * current->bio_list to the rescuer workqueue.
> > */
> > -
> > - if (current->bio_list && !bio_list_empty(current->bio_list))
> > - gfp_mask &= ~__GFP_WAIT;
> > -
> > p = mempool_alloc(bs->bio_pool, gfp_mask);
> > - if (!p && gfp_mask != saved_gfp) {
> > - punt_bios_to_rescuer(bs);
> > - gfp_mask = saved_gfp;
> > - p = mempool_alloc(bs->bio_pool, gfp_mask);
> > - }
> > -
> > front_pad = bs->front_pad;
> > inline_vecs = BIO_INLINE_VECS;
> > }
> > @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >
> > if (nr_iovecs > inline_vecs) {
> > bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > - if (!bvl && gfp_mask != saved_gfp) {
> > - punt_bios_to_rescuer(bs);
> > - gfp_mask = saved_gfp;
> > - bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > - }
> > -
>
> Looks you touched rescuing path for bio allocation, and better to just
> do one thing in one patch.

Yes, good point, I've split the patches, you can see the result in the 2
topmost commits in this branch:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

Definitely helped clean up the patches and make them more
approachable/reviewable. Thanks for the suggestion.

I'll hold off on sending out v4 of these patches until I can better
undersatnd your concerns about using the rescue workqueue during
schedule.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/