Re: [PATCH] blk-mq: rationalize plug

From: Jeff Moyer
Date: Mon Apr 20 2015 - 13:34:33 EST


Shaohua Li <shli@xxxxxx> writes:

> Previous post of the patch is lost, so I repost. This is helpful, for
> example, using scsi-mq for a sata drive.
>
> plug is still helpful for workload with IO merge, but it can be harmful
> otherwise especially with multiple hardware queues, as there is (supposed) no
> lock contention in this case and plug can introduce latency.
>
> For single queue, we always do plug. Reducing lock contention is still a win.
> For multiple queues, we do a limited plug, eg plug only if there is merge. If a
> request doesn't have merge with following request, the requet will be
> dispatched immediately.
>
> An example workload here is fsync write a block device. Without plug
> merge, sequential write (fsync makes it sync IO) will dispatch 4k IO.

I like the general approach. I do wonder whether we should hold back a
single I/O at all times, or whether we should do something similar to
what Mike Snitzer did in the dm-mpath driver, where he keeps track of
the end position of the last I/O (without holding the I/O back) to
determine whether we're doing sequential I/O. With your approach, we
will be able to get merges from the very start of a sequential I/O
stream. With Mike's approach, you don't get merges until the second and
third I/O. Mike's way you get potentially better latency for random I/O
at the cost of some extra fields in a data structure. Something to
think about.

I have specfic comments inlined below. I think this patch would be
better layered on top of my fix for single queue merging, but the two
are basically independent, so I could go either way on that.

> @@ -1230,8 +1269,12 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> {
> const int is_sync = rw_is_sync(bio->bi_rw);
> const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
> + unsigned int use_plug, request_count = 0;
> struct blk_map_ctx data;
> struct request *rq;
> + struct blk_plug *plug;
> +
> + use_plug = !is_flush_fua;

We can get rid of the use_plug variable. If is_flush_fua is set, then
we'll never get to the point where it is tested. The patched code looks
like this:

use_plug = !is_flush_fua;
...
if (unlikely(is_flush_fua)) {
blk_mq_bio_to_request(rq, bio);
blk_insert_flush(rq);
goto run_queue;
}

See? Here we just skip out to run the queue. The check comes later:

/*
* we do limited pluging. If bio can be merged, do merge. Otherwise the
* existing request in the plug list will be issued. So the plug list
* will have one request at most
*/
plug = current->plug;
if (use_plug && plug) {

So that if statement could be just 'if (plug)'.


> blk_queue_bounce(q, &bio);
>
> @@ -1240,6 +1283,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> return;
> }
>
> + if (use_plug && !blk_queue_nomerges(q) &&
> + blk_attempt_plug_merge(q, bio, &request_count))
> + return;
> +

You've just added merging for flush/fua requets. Was that intentional?
We don't attempt them in the sq case, and we should probably be
consistent.


> rq = blk_mq_map_request(q, bio, &data);

> if (unlikely(!rq))
> return;
> @@ -1251,37 +1298,34 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> }
>
> /*
> - * If the driver supports defer issued based on 'last', then
> - * queue it up like normal since we can potentially save some
> - * CPU this way.
> + * we do limited pluging. If bio can be merged, do merge. Otherwise the
> + * existing request in the plug list will be issued. So the plug list
> + * will have one request at most
> */
> - if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> - struct blk_mq_queue_data bd = {
> - .rq = rq,
> - .list = NULL,
> - .last = 1
> - };
> - int ret;
> + plug = current->plug;
> + if (use_plug && plug) {
> + struct request *old_rq = NULL;
>
> blk_mq_bio_to_request(rq, bio);
> + if (!list_empty(&plug->mq_list)) {
> + old_rq = list_first_entry(&plug->mq_list,
> + struct request, queuelist);
> + list_del_init(&old_rq->queuelist);
> + }
> + list_add_tail(&rq->queuelist, &plug->mq_list);
> + blk_mq_put_ctx(data.ctx);
> + if (!old_rq)
> + return;
> + if (!blk_mq_direct_issue_request(old_rq))
> + return;
> + blk_mq_insert_request(old_rq, false, true, true);
> + return;
> + }
>
> - /*
> - * For OK queue, we are done. For error, kill it. Any other
> - * error (busy), just add it to our list as we previously
> - * would have done
> - */
> - ret = q->mq_ops->queue_rq(data.hctx, &bd);
> - if (ret == BLK_MQ_RQ_QUEUE_OK)
> + if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> + blk_mq_bio_to_request(rq, bio);
> + if (!blk_mq_direct_issue_request(rq))

I think the BLK_MQ_F_DEFER_ISSUE suport isn't broken by this patch, but
it's not entirely clear to me. You've essentially split out that
particular code path into two different branches, where before it was
much easier to follow. Prior to your patch, if the flag was set, we
skip to the blk_mq_merge_queue_io call, bypassing the direct queue_rq.
After your patch, blk_mq_insert_request might be called, or
blk_mq_merge_queue_io might be. Like I said, I *think* the end result
is correct, but I'm not completely sure. Jens, do you want to check
that?

> @@ -1314,7 +1358,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
> * If we have multiple hardware queues, just go directly to
> * one of those for sync IO.
> */
> - use_plug = !is_flush_fua && !is_sync;
> + use_plug = !is_flush_fua;
>
> blk_queue_bounce(q, &bio);

For this part I'd rather use my patch. Would you mind rebasing your
patch on top of the sq plugging patch I submitted?

One final question, sort of related to this patch. At the end of
blk_mq_make_request, we have this:

run_queue:
blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);

So, we don't run the queue synchronously for flush/fua. Is that logic
right? Why don't we want to push that out immediately?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/