[RFC] blk-mq and I/O scheduling

From: Andreas Herrmann
Date: Thu Nov 19 2015 - 07:03:14 EST

Next message: Borislav Petkov: "Re: [PATCH 1/3] x86/xen: Avoid fast syscall path for Xen PV guests"
Previous message: Paolo Bonzini: "Re: [PATCH] KVM: x86: don't expose syscall/sysret to intel 32-bit guest"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I've looked into blk-mq and possible support for I/O scheduling.

The reason for this is to minimize performance degradation with
rotational devices when scsi_mod.use_blk_mq=1 is switched on.

I think that the degradation is well reflected with fio measurements.
With an increasing number of jobs you'll encounter a significant
performance drop for sequential reads and writes with blk-mq in
contrast to CFQ. blk-mq ensures that requests from different processes
(CPUs) are "perfectly shuffled" in a hardware queue. This is no
problem for non-rotational devices for which blk-mq is aimed for but
not so nice for rotational disks.

(i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
deadline scheduling) from branch mq-deadline of linux-block
repository. I've not seen a significant performance impact when
enabling it (neither for non-rotational nor for rotational
disks).

(ii) I've played with code to enable sorting/merging of requests. I
did this in flush_busy_ctxs. This didn't have a performance
impact either. On a closer look this was due to high frequency
of calls to __blk_mq_run_hw_queue. There was almost nothing to
sort (too few requests). I guess that's also the reason why (i)
had not much impact.

(iii) With CFQ I've observed similar performance patterns to blk-mq if
slice_idle was set to 0.

(iv) I thought about introducing a per software queue time slice
during which blk-mq will service only one software queue (one
CPU) and not flush all software queues. This could help to
enqueue multiple requests belonging to the same process (as long
as it runs on same CPU) into a hardware queue. A minimal patch
to implement this is attached below.

The latter helped to improve performance for sequential reads and
writes. But it's not on a par with CFQ. Increasing the time slice is
suboptimal (as shown with the 2ms results, see below). It might be
possible to get better performance when further reducing the initial
time slice and adapting it up to a maximum value if there are
repeatedly pending requests for a CPU.

After these observations and assuming that non-rotational devices are
most likely fine using blk-mq without I/O scheduling support I wonder
whether

- it's really a good idea to re-implement scheduling support for
blk-mq that eventually behaves like CFQ for rotational devices.

- it's technical possible to support both blk-mq and CFQ for different
devices on the same host adapter. This would allow to use "good old"
code for "good old" rotational devices. (But this might not be a
choice if in the long run a goal is to get rid of non-blk-mq code --
not sure what the plans are.)

What do you think about this?

Thanks,

Andreas

---

Here are some results obtained with below patch on top of Linux
4.4.0-rc1. to illustrate (iv).
- Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
- fio version as of 2.2.9-37-g0e1c4
- results were gathered iterating using rw and numjobs parameter, e.g.:

fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
--ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

- rotational device was a 1500GB Samsung HD155UI

ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
scsi 2:0:0:0: Direct-Access ATA SAMSUNG HD155UI 0001 PQ: 0 ANSI: 5

(1) cfq, slice_idle=8 (default)
(2) cfq, slice_idle=0
(3) blk-mq, use_time_slice=0 (behaviour matches unmodified mainline)
(4) blk-mq, use_time_slice=1 (1ms)
(5) blk-mq, use_time_slice=2 (2ms)

-------------------------------- -------------------------------
n cfq cfq blk blk blk n cfq cfq blk blk blk
u -mq -mq -mq u -mq -mq -mq
m (1) (2) (3) (4) (5) m (1) (2) (3) (4) (5)
j j
o 1ms 2ms o 1ms 2ms
b b
s s
-------------------------------- -------------------------------
read iops randread iops
-------------------------------- -------------------------------
1 17594 17344 19892 19452 19505 1 102 103 100 102 102
2 16939 8394 8201 13187 12141 2 97 97 97 97 97
3 16972 7410 8759 12243 11311 3 103 102 101 102 103
4 15125 8692 9418 11585 11388 4 109 108 109 109 108
5 14721 6159 6050 10430 10539 5 119 119 118 118 118
6 13833 5997 6087 8775 8812 6 125 125 126 125 125
7 13247 6022 6187 8493 9354 7 128 132 130 130 129
8 12979 5970 6169 8251 7688 8 137 136 135 137 136
-------------------------------- -------------------------------
write iops randwrite iops
-------------------------------- -------------------------------
1 15841 15988 17239 17306 17274 1 182 183 184 181 179
2 16679 9860 10203 13512 11073 2 169 174 172 170 170
3 16579 9684 9928 11430 10737 3 162 163 160 161 160
4 16795 10095 10078 10810 10331 4 161 162 162 163 163
5 16586 9980 10051 9802 9567 5 162 160 162 162 161
6 16621 9895 9857 10213 9920 6 161 163 159 159 160
7 16387 9780 9783 9812 9484 7 165 164 164 159 160
8 16513 9652 9658 9778 9572 8 165 165 165 164 165
-------------------------------- -------------------------------
rw (read) iops randrw (read) iops
-------------------------------- -------------------------------
1 4061 3840 4301 4142 4172 1 65 65 65 64 64
2 4028 4569 4807 3815 3792 2 63 63 63 64 63
3 3615 3128 3502 3216 3202 3 64 63 63 64 63
4 3611 2889 3207 2821 2862 4 66 67 66 66 67
5 3419 3026 3033 2819 2871 5 71 70 71 70 71
6 3450 2981 2969 2619 2656 6 74 74 73 73 73
7 3448 2719 2709 2565 2594 7 76 75 74 75 76
8 3285 2283 2184 2233 2408 8 76 77 78 77 77
-------------------------------- -------------------------------
rw (write) iops randrw (write) iops
-------------------------------- -------------------------------
1 4057 3841 4295 4138 4168 1 63 63 62 62 62
2 4014 4562 4802 3798 3773 2 60 60 60 60 60
3 3615 3132 3503 3216 3201 3 55 55 55 55 54
4 3611 2885 3205 2825 2865 4 59 59 58 58 59
5 3418 3044 3049 2823 2877 5 61 61 62 61 62
6 3474 2984 2970 2615 2653 6 64 64 63 63 63
7 3451 2711 2702 2565 2593 7 65 65 65 65 65
8 3280 2285 2188 2234 2403 8 68 68 68 67 67
-------------------------------- -------------------------------
---
blk-mq: Introduce per sw queue time-slice

Signed-off-by: Andreas Herrmann <aherrmann@xxxxxxxx>
---
block/blk-mq-sysfs.c | 27 +++++++++
block/blk-mq.c | 162 ++++++++++++++++++++++++++++++++++++++++---------
include/linux/blk-mq.h | 6 ++
3 files changed, 165 insertions(+), 30 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 1cf1878..400bf93 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
return ret;
}

+static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
+ char *page)
+{
+ return sprintf(page, "%u\n", hctx->use_time_slice);
+}
+
+static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
+ const char *page, size_t length)
+{
+ unsigned long long store;
+ int err;
+
+ err = kstrtoull(page, 10, &store);
+ if (err)
+ return -EINVAL;
+
+ hctx->use_time_slice = (unsigned) store;
+ return length;
+}
+
static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
@@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
.show = blk_mq_hw_sysfs_poll_show,
};

+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
+ .attr = {.name = "use_time_slice", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_mq_hw_sysfs_tslice_show,
+ .store = blk_mq_hw_sysfs_tslice_store,
+};
+
static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_queued.attr,
&blk_mq_hw_sysfs_run.attr,
@@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_cpus.attr,
&blk_mq_hw_sysfs_active.attr,
&blk_mq_hw_sysfs_poll.attr,
+ &blk_mq_hw_sysfs_tslice.attr,
NULL,
};

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ae09de..b71d864 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,6 +68,8 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,

if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+
+ cpumask_set_cpu(ctx->cpu, hctx->cpu_pending_mask);
}

static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
@@ -76,6 +78,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
struct blk_align_bitmap *bm = get_bm(hctx, ctx);

clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+ cpumask_clear_cpu(ctx->cpu, hctx->cpu_pending_mask);
}

void blk_mq_freeze_queue_start(struct request_queue *q)
@@ -685,10 +688,26 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
* Process software queues that have been marked busy, splicing them
* to the for-dispatch
*/
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+ struct list_head *list, int cpu)
{
+ struct request_queue *q = hctx->queue;
struct blk_mq_ctx *ctx;
int i;
+ const int use_time_slice = hctx->use_time_slice;
+
+ if (use_time_slice && cpu != -1) {
+ if (cpu != -1) {
+ ctx = __blk_mq_get_ctx(q, cpu);
+ spin_lock(&ctx->lock);
+ list_splice_tail_init(&ctx->rq_list, list);
+ spin_lock(&hctx->lock);
+ blk_mq_hctx_clear_pending(hctx, ctx);
+ spin_unlock(&hctx->lock);
+ spin_unlock(&ctx->lock);
+ }
+ return;
+ }

for (i = 0; i < hctx->ctx_map.size; i++) {
struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
@@ -705,9 +724,11 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
break;

ctx = hctx->ctxs[bit + off];
- clear_bit(bit, &bm->word);
spin_lock(&ctx->lock);
list_splice_tail_init(&ctx->rq_list, list);
+ spin_lock(&hctx->lock);
+ blk_mq_hctx_clear_pending(hctx, ctx);
+ spin_unlock(&hctx->lock);
spin_unlock(&ctx->lock);

bit++;
@@ -716,6 +737,71 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
}

/*
+ * It'd be great if the workqueue API had a way to pass
+ * in a mask and had some smarts for more clever placement.
+ * For now we just round-robin here, switching for every
+ * BLK_MQ_CPU_WORK_BATCH queued items.
+ */
+static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+ if (hctx->queue->nr_hw_queues == 1)
+ return WORK_CPU_UNBOUND;
+
+ if (--hctx->next_cpu_batch <= 0) {
+ int cpu = hctx->next_cpu, next_cpu;
+
+ next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+ if (next_cpu >= nr_cpu_ids)
+ next_cpu = cpumask_first(hctx->cpumask);
+
+ hctx->next_cpu = next_cpu;
+ hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+
+ return cpu;
+ }
+
+ return hctx->next_cpu;
+}
+
+static int blk_mq_sched_slice_expired(struct blk_mq_hw_ctx *hctx)
+{
+ /*
+ * Use time slice before requests of other CPUs are serviced.
+ * Maybe change to serve n requests certain number of sectors
+ * before switching to other CPU.
+ */
+ if (time_after_eq(jiffies, hctx->sched_slice_exp))
+ return 1;
+
+ return 0;
+}
+
+static int blk_mq_sched_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+ int c = hctx->sched_cpu;
+
+ if (c != -1)
+ cpumask_clear_cpu(c, hctx->cpu_service_mask);
+
+ if (cpumask_and(hctx->cpu_next_mask, hctx->cpu_pending_mask,
+ hctx->cpu_service_mask)) {
+ c = cpumask_first(hctx->cpu_next_mask);
+ } else {
+ /* we are idle, epoch ended, reset */
+ hctx->sched_slice_exp = 0;
+ cpumask_setall(hctx->cpu_service_mask);
+ c = cpumask_first(hctx->cpu_pending_mask);
+ }
+
+ if (c >= nr_cpu_ids)
+ return -1;
+
+ hctx->sched_slice_exp = jiffies +
+ msecs_to_jiffies(hctx->use_time_slice);
+ return c;
+}
+
+/*
* Run this hardware queue, pulling any software queues mapped to it in.
* Note that this function currently has various problems around ordering
* of IO. In particular, we'd like FIFO behaviour on handling existing
@@ -729,6 +815,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
LIST_HEAD(driver_list);
struct list_head *dptr;
int queued;
+ const int use_time_slice = hctx->use_time_slice;
+ int cpu = -1;

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

@@ -737,10 +825,19 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)

hctx->run++;

+ if (use_time_slice) {
+ spin_lock(&hctx->lock);
+ if (blk_mq_sched_slice_expired(hctx))
+ hctx->sched_cpu = blk_mq_sched_next_cpu(hctx);
+ cpu = hctx->sched_cpu;
+ spin_unlock(&hctx->lock);
+ /* don't leave, but check dispatch queue if cpu == -1 */
+ }
+
/*
* Touch any software queue that has pending entries.
*/
- flush_busy_ctxs(hctx, &rq_list);
+ flush_busy_ctxs(hctx, &rq_list, cpu);

/*
* If we have previous entries on our dispatch list, grab them
@@ -825,36 +922,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
* blk_mq_run_hw_queue() already checks the STOPPED bit
**/
blk_mq_run_hw_queue(hctx, true);
+ } else if (use_time_slice && !cpumask_empty(hctx->cpu_pending_mask)) {
+ long t = hctx->sched_slice_exp - jiffies;
+ t = (t < 0) ? 0 : t;
+ kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+ &hctx->run_work,
+ (unsigned int)t);
}
}

-/*
- * It'd be great if the workqueue API had a way to pass
- * in a mask and had some smarts for more clever placement.
- * For now we just round-robin here, switching for every
- * BLK_MQ_CPU_WORK_BATCH queued items.
- */
-static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
-{
- if (hctx->queue->nr_hw_queues == 1)
- return WORK_CPU_UNBOUND;
-
- if (--hctx->next_cpu_batch <= 0) {
- int cpu = hctx->next_cpu, next_cpu;
-
- next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
- if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first(hctx->cpumask);
-
- hctx->next_cpu = next_cpu;
- hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
- return cpu;
- }
-
- return hctx->next_cpu;
-}
-
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
{
if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
@@ -991,7 +1067,10 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
struct blk_mq_ctx *ctx = rq->mq_ctx;

__blk_mq_insert_req_list(hctx, ctx, rq, at_head);
+ spin_lock(&hctx->lock);
blk_mq_hctx_mark_pending(hctx, ctx);
+ spin_unlock(&hctx->lock);
+
}

void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
@@ -1580,7 +1659,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
spin_lock(&ctx->lock);
if (!list_empty(&ctx->rq_list)) {
list_splice_init(&ctx->rq_list, &tmp);
+ spin_lock(&hctx->lock);
blk_mq_hctx_clear_pending(hctx, ctx);
+ spin_unlock(&hctx->lock);
}
spin_unlock(&ctx->lock);

@@ -1599,7 +1680,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
}

hctx = q->mq_ops->map_queue(q, ctx->cpu);
+ spin_lock(&hctx->lock);
blk_mq_hctx_mark_pending(hctx, ctx);
+ spin_unlock(&hctx->lock);

spin_unlock(&ctx->lock);

@@ -1688,6 +1771,10 @@ static int blk_mq_init_hctx(struct request_queue *q,
hctx->queue_num = hctx_idx;
hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;

+ hctx->use_time_slice = 0;
+ hctx->sched_slice_exp = 0;
+ hctx->sched_cpu = -1;
+
blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
blk_mq_hctx_notify, hctx);
blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
@@ -2010,6 +2097,18 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
node))
goto err_hctxs;

+ if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_pending_mask,
+ GFP_KERNEL, node))
+ goto err_hctxs;
+
+ if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_service_mask,
+ GFP_KERNEL, node))
+ goto err_hctxs;
+
+ if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_next_mask,
+ GFP_KERNEL, node))
+ goto err_hctxs;
+
atomic_set(&hctxs[i]->nr_active, 0);
hctxs[i]->numa_node = node;
hctxs[i]->queue_num = i;
@@ -2073,6 +2172,9 @@ err_hctxs:
if (!hctxs[i])
break;
free_cpumask_var(hctxs[i]->cpumask);
+ free_cpumask_var(hctxs[i]->cpu_pending_mask);
+ free_cpumask_var(hctxs[i]->cpu_service_mask);
+ free_cpumask_var(hctxs[i]->cpu_next_mask);
kfree(hctxs[i]);
}
err_map:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index daf17d7..6e192b5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -28,6 +28,9 @@ struct blk_mq_hw_ctx {
struct delayed_work run_work;
struct delayed_work delay_work;
cpumask_var_t cpumask;
+ cpumask_var_t cpu_service_mask;
+ cpumask_var_t cpu_pending_mask;
+ cpumask_var_t cpu_next_mask;
int next_cpu;
int next_cpu_batch;

@@ -57,6 +60,9 @@ struct blk_mq_hw_ctx {

atomic_t nr_active;

+ int use_time_slice;
+ unsigned long sched_slice_exp;
+ int sched_cpu;
struct blk_mq_cpu_notifier cpu_notifier;
struct kobject kobj;

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Borislav Petkov: "Re: [PATCH 1/3] x86/xen: Avoid fast syscall path for Xen PV guests"
Previous message: Paolo Bonzini: "Re: [PATCH] KVM: x86: don't expose syscall/sysret to intel 32-bit guest"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]