Re: loop block-mq conversion scalability issues

From: Justin M. Forbes
Date: Mon Apr 27 2015 - 09:48:48 EST


On Sun, 2015-04-26 at 23:27 +0800, Ming Lei wrote:
> Hi Justin,
>
> On Fri, 24 Apr 2015 16:46:02 -0500
> "Justin M. Forbes" <jforbes@xxxxxxxxxxxxxxxxx> wrote:
>
> > On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
> > > Hi Justin,
> > >
> > > Thanks for the report.
> > >
> > > On Thu, 23 Apr 2015 16:04:10 -0500
> > > "Justin M. Forbes" <jforbes@xxxxxxxxxx> wrote:
> > >
> > > > The block-mq conversion for loop in 4.0 kernels is showing us an
> > > > interesting scalability problem with live CDs (ro, squashfs). It was
> > > > noticed when testing the Fedora beta that the more CPUs a liveCD image
> > > > was given, the slower it would boot. A 4 core qemu instance or bare
> > > > metal instance took more than twice as long to boot compared to a single
> > > > CPU instance. After investigating, this came directly to the block-mq
> > > > conversion, reverting these 4 patches will return performance. More
> > > > details are available at
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> > > > I don't think that reverting the patches is the ideal solution so I am
> > > > looking for other options. Since you know this code a bit better than I
> > > > do I thought I would run it by you while I am looking as well.
> > >
> > > I can understand the issue because the default @max_active for
> > > alloc_workqueue() is quite big(512), which may cause too much
> > > context switchs, then loop I/O performance gets decreased.
> > >
> > > Actually I have written the kernel dio/aio based patch for decreasing
> > > both CPU and memory utilization without sacrificing I/O performance,
> > > and I will try to improve and push the patch during this cycle and hope
> > > it can be merged(kernel/aio.c change is dropped, and only fs change is
> > > needed on fs/direct-io.c).
> > >
> > > But the following change should help for your case, could you test it?
> > >
> > > ---
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index c6b3726..b1cb41d 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
> > > }
> > >
> > > loop_wq = alloc_workqueue("kloopd",
> > > - WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> > > + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
> > > if (!loop_wq) {
> > > err = -ENOMEM;
> > > goto misc_out;
> > >
> > Patch tested, it made things work (I gave up after 5 minutes and boot
> > still seemed hung). I also tried values of 1, 16, 64, and 128).
> > Everything below 128 was much worse than the current situation. Setting
> > it at 128 seemed about the same as booting without the patch. I can do
> > some more testing over the weekend, but I don't think this is the
> > correct solution.
>
> For describing the problem easily, follows the fedora live CD file structure first:
>
> Fedora-Live-Workstation-x86_64-22_Beta-TC8.iso
> =>LiveOS/
> squashfs.img
> =>LiveOS/
> ext3fs.img
>
> Looks at least two reasons are related with the problem:
>
> - not like other filesyststems(such as ext4), squashfs is a bit special, and
> I observed that increasing I/O jobs to access file in squashfs can't improve
> I/O performance at all, but it can for ext4
>
> - nested loop: both squashfs.img and ext3fs.img are mounted as loop block
>
> One key idea in the commit b5dd2f60(block: loop: improve performance via blk-mq)
> is to submit I/O concurrently from more than one context(worker), like posix
> AIO style. Unfortunately this way can't improve I/O performance for squashfs,
> and with extra cost of kworker threads, and nested loop makes it worse. Meantime,
> during booting, there are lots of concurrent tasks requiring CPU, so the high
> priority kworker threads for loop can affect other boot tasks, then booting time
> is increased.
>
> I think it may improve the problem by removing the nest loop, such as
> extract files in ext3fs.img to squashfs.img.
>
> > I would be interested in testing your dio/aio patches as well though.
>
> squashfs doesn't support dio, so the dio/aio patch can't help much, but
> the motivation for introducing dio/aio is really for avoiding double cache
> and decreasing CPU utilization[1].
>
> [1], http://marc.info/?l=linux-kernel&m=142116397525668&w=2
>
> The following patch may help the situation, but for this case, I am wondering
> it can compete with previous loop.
> ---
> >From 0af95571a2a066b4f3bacaac2c75b39e3c701c6e Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@xxxxxxxxxxxxx>
> Date: Sun, 26 Apr 2015 17:53:56 +0800
> Subject: [PATCH] block: loop: avoiding too many pending per work I/O
>
> If there are too many pending per work I/O, too many
> high priority work thread can be generated so that
> system performance can be effected.
>
> This patch limits the max pending per work I/O as 32,
> and will degrage to single queue mode when the max number
> is reached.
>
> This patch fixes Fedora 22 live booting performance
> regression when it is booted from squashfs over dm
> based on loop.
>
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxxxxx>
> ---
> drivers/block/loop.c | 21 ++++++++++++++++++---
> drivers/block/loop.h | 2 ++
> 2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c6b3726..55bd04f 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1448,13 +1448,24 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> const struct blk_mq_queue_data *bd)
> {
> struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> + struct loop_device *lo = cmd->rq->q->queuedata;
> + bool single_queue = !!(cmd->rq->cmd_flags & REQ_WRITE);
> +
> + /*
> + * Degrade to single queue mode if the pending per work
> + * I/O number reaches 16, otherwise too many high priority
> + * worker threads may effect system performance as reported
> + * in fedora live booting from squashfs over loop.
> + */
> + if (atomic_read(&lo->pending_per_work_io) >= 16)
> + single_queue = true;
>
> blk_mq_start_request(bd->rq);
>
> - if (cmd->rq->cmd_flags & REQ_WRITE) {
> - struct loop_device *lo = cmd->rq->q->queuedata;
> + if (single_queue) {
> bool need_sched = true;
>
> + cmd->per_work_io = false;
> spin_lock_irq(&lo->lo_lock);
> if (lo->write_started)
> need_sched = false;
> @@ -1466,6 +1477,8 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> if (need_sched)
> queue_work(loop_wq, &lo->write_work);
> } else {
> + cmd->per_work_io = true;
> + atomic_inc(&lo->pending_per_work_io);
> queue_work(loop_wq, &cmd->read_work);
> }
>
> @@ -1490,6 +1503,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> if (ret)
> cmd->rq->errors = -EIO;
> blk_mq_complete_request(cmd->rq);
> + if (cmd->per_work_io)
> + atomic_dec(&lo->pending_per_work_io);
> }
>
> static void loop_queue_write_work(struct work_struct *work)
> @@ -1831,7 +1846,7 @@ static int __init loop_init(void)
> }
>
> loop_wq = alloc_workqueue("kloopd",
> - WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
> if (!loop_wq) {
> err = -ENOMEM;
> goto misc_out;
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index ffb6dd6..06d8f1a 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -57,6 +57,7 @@ struct loop_device {
> struct list_head write_cmd_head;
> struct work_struct write_work;
> bool write_started;
> + atomic_t pending_per_work_io;
> int lo_state;
> struct mutex lo_ctl_mutex;
>
> @@ -68,6 +69,7 @@ struct loop_device {
> struct loop_cmd {
> struct work_struct read_work;
> struct request *rq;
> + bool per_work_io;
> struct list_head list;
> };
>

This patch tests well. It gives comparable boot times to the 3.19 loop
implementation and does not seem to have problems scaling with SMP. I
would love to see this make 4.1 and 4.0.x stable as well.

Thanks,
Justin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/