Re: [RFC][PATCH] Make io_submit non-blocking

From: Rajat Sharma
Date: Tue Jul 24 2012 - 08:34:18 EST


Hi Ankit,

On Tue, Jul 24, 2012 at 5:11 PM, Ankit Jain <jankit@xxxxxxx> wrote:
>
>
> Currently, io_submit tries to execute the io requests on the
> same thread, which could block because of various reaons (eg.
> allocation of disk blocks). So, essentially, io_submit ends
> up being a blocking call.
>

Ideally filesystem should take care of it e.g. by deferring such time
consuming allocations and return -EIOCBQUEUED immediately. But have
you seen such cases?

> With this patch, io_submit prepares all the kiocbs and then
> adds (kicks) them to ctx->run_list (kicked) in one go and then
> schedules the workqueue. The actual operations are not executed
> on io_submit's process context, so it can return very quickly.
>

With lots of application threads firing continuous IOs, workqueue
threads might become bottleneck and you might have to eventually
develop a priority scheduling. This workqueue was originally designed
for IO retries which is an error path, now consumers of workqueue
might easily increase by 100x.

> This run_list is processed either on a workqueue or in response to
> an io_getevents call. This utilizes the existing retry infrastructure.
>
> It uses override_creds/revert_creds to use the submitting process'
> credentials when processing the iocb request from the workqueue. This
> is required for proper support of quota and reserved block access.
>
> Currently, we use block plugging in io_submit, since most of the IO
> was being done there itself. This patch moves it to aio_kick_handler
> and aio_run_all_iocbs, where the IO gets submitted.
>
> All the tests were run with ext4.
>
> I tested the patch with fio
> (fio rand-rw-disk.fio --max-jobs=2 --latency-log
> --bandwidth-log)
>
> **Unpatched**
> read : io=102120KB, bw=618740 B/s, iops=151 , runt=169006msec
> slat (usec): min=275 , max=87560 , avg=6571.88, stdev=2799.57
>
> write: io=102680KB, bw=622133 B/s, iops=151 , runt=169006msec
> slat (usec): min=2 , max=196 , avg=24.66, stdev=20.35
>
> **Patched**
> read : io=102864KB, bw=504885 B/s, iops=123 , runt=208627msec
> slat (usec): min=0 , max=120 , avg= 1.65, stdev= 3.46
>
> write: io=101936KB, bw=500330 B/s, iops=122 , runt=208627msec
> slat (usec): min=0 , max=131 , avg= 1.85, stdev= 3.27
>
> From above, it can be seen that submit latencies improve a lot with the
> patch. The full fio results for the "old"(unpatched) and "new"(patched)
> cases are attached. Results with both ramdisk (*rd*) and disk attached,
> and also the corresponding fio files.
>
> Some variations I tried:
>
> 1. I tried to submit one iocb at a time (lock/unlock ctx->lock), and
> that had good performance for a regular disk, but when I tested with a
> ramdisk (to simulate very fast disk), performance was extremely bad.
> Submitting all the iocbs from an io_submit in one go, restored the
> performance (latencies+bandwidth).
>
> 2. I was earlier trying to use queue_delayed_work with 0 timeout, but
> that worsened the submit latencies a bit but improved bandwidth.
>
> 3. Also, I tried not using aio_queue_work from io_submit call, and instead
> depending on an already scheduled one or the iocbs being run when
> io_getevents gets called. This seemed to give improved perfomance. But
> does this constitute as change of api semantics?
>

I once have observed latency issues with aio_queue_work with lesser
number of threads when I was trying to resubmit IOs on a ramdisk, as
this function introduces a mandatory delay if nobody is waiting on
this iocb. The latencies were high but with large number of threads,
effect was not prominent.

> Signed-off-by: Ankit Jain <jankit@xxxxxxx>
>
> --
> diff --git a/fs/aio.c b/fs/aio.c
> index 71f613c..79801096b 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -563,6 +563,11 @@ static int __aio_put_req(struct kioctx *ctx, struct
> kiocb *req)
> req->ki_cancel = NULL;
> req->ki_retry = NULL;
>
> + if (likely(req->submitter_cred)) {
> + put_cred(req->submitter_cred);
> + req->submitter_cred = NULL;
> + }
> +
> fput(req->ki_filp);
> req->ki_filp = NULL;
> really_put_req(ctx, req);
> @@ -659,6 +664,7 @@ static ssize_t aio_run_iocb(struct kiocb *iocb)
> struct kioctx *ctx = iocb->ki_ctx;
> ssize_t (*retry)(struct kiocb *);
> ssize_t ret;
> + const struct cred *old_cred = NULL;
>
> if (!(retry = iocb->ki_retry)) {
> printk("aio_run_iocb: iocb->ki_retry = NULL\n");
> @@ -703,12 +709,19 @@ static ssize_t aio_run_iocb(struct kiocb *iocb)
> goto out;
> }
>
> + if (iocb->submitter_cred)
> + /* setup creds */
> + old_cred = override_creds(iocb->submitter_cred);
> +
> /*
> * Now we are all set to call the retry method in async
> * context.
> */
> ret = retry(iocb);
>
> + if (old_cred)
> + revert_creds(old_cred);
> +
> if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
> /*
> * There's no easy way to restart the syscall since other
> AIO's
> @@ -804,10 +817,14 @@ static void aio_queue_work(struct kioctx * ctx)
> */
> static inline void aio_run_all_iocbs(struct kioctx *ctx)
> {
> + struct blk_plug plug;
> +
> + blk_start_plug(&plug);
> spin_lock_irq(&ctx->ctx_lock);
> while (__aio_run_iocbs(ctx))
> ;
> spin_unlock_irq(&ctx->ctx_lock);
> + blk_finish_plug(&plug);
> }
>
> /*
> @@ -825,13 +842,16 @@ static void aio_kick_handler(struct work_struct
> *work)
> mm_segment_t oldfs = get_fs();
> struct mm_struct *mm;
> int requeue;
> + struct blk_plug plug;
>
> set_fs(USER_DS);
> use_mm(ctx->mm);
> + blk_start_plug(&plug);
> spin_lock_irq(&ctx->ctx_lock);
> requeue =__aio_run_iocbs(ctx);
> mm = ctx->mm;
> spin_unlock_irq(&ctx->ctx_lock);
> + blk_finish_plug(&plug);
> unuse_mm(mm);
> set_fs(oldfs);
> /*
> @@ -1506,12 +1526,14 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb,
> bool compat)
>
> static int io_submit_one(struct kioctx *ctx, struct iocb __user
> *user_iocb,
> struct iocb *iocb, struct kiocb_batch *batch,
> - bool compat)
> + bool compat, struct kiocb **req_entry)
> {
> struct kiocb *req;
> struct file *file;
> ssize_t ret;
>
> + *req_entry = NULL;
> +
> /* enforce forwards compatibility on users */
> if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> pr_debug("EINVAL: io_submit: reserve field set\n");
> @@ -1537,6 +1559,7 @@ static int io_submit_one(struct kioctx *ctx, struct
> iocb __user *user_iocb,
> fput(file);
> return -EAGAIN;
> }
> +
> req->ki_filp = file;
> if (iocb->aio_flags & IOCB_FLAG_RESFD) {
> /*
> @@ -1567,38 +1590,16 @@ static int io_submit_one(struct kioctx *ctx,
> struct iocb __user *user_iocb,
> req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
> req->ki_opcode = iocb->aio_lio_opcode;
>
> + req->submitter_cred = get_current_cred();
> +
> ret = aio_setup_iocb(req, compat);
>
> if (ret)
> goto out_put_req;
>
> - spin_lock_irq(&ctx->ctx_lock);
> - /*
> - * We could have raced with io_destroy() and are currently holding
> a
> - * reference to ctx which should be destroyed. We cannot submit IO
> - * since ctx gets freed as soon as io_submit() puts its reference.
> The
> - * check here is reliable: io_destroy() sets ctx->dead before
> waiting
> - * for outstanding IO and the barrier between these two is
> realized by
> - * unlock of mm->ioctx_lock and lock of ctx->ctx_lock.
> Analogously we
> - * increment ctx->reqs_active before checking for ctx->dead and
> the
> - * barrier is realized by unlock and lock of ctx->ctx_lock. Thus
> if we
> - * don't see ctx->dead set here, io_destroy() waits for our IO to
> - * finish.
> - */
> - if (ctx->dead) {
> - spin_unlock_irq(&ctx->ctx_lock);
> - ret = -EINVAL;
> - goto out_put_req;
> - }
> - aio_run_iocb(req);
> - if (!list_empty(&ctx->run_list)) {
> - /* drain the run list */
> - while (__aio_run_iocbs(ctx))
> - ;
> - }
> - spin_unlock_irq(&ctx->ctx_lock);
> -
> aio_put_req(req); /* drop extra ref to req */
> +
> + *req_entry = req;
> return 0;
>
> out_put_req:
> @@ -1613,8 +1614,10 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> struct kioctx *ctx;
> long ret = 0;
> int i = 0;
> - struct blk_plug plug;
> struct kiocb_batch batch;
> + struct kiocb **req_arr = NULL;
> + int nr_submitted = 0;
> + int req_arr_cnt = 0;
>
> if (unlikely(nr < 0))
> return -EINVAL;
> @@ -1632,8 +1635,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> }
>
> kiocb_batch_init(&batch, nr);
> -
> - blk_start_plug(&plug);
> + req_arr = kmalloc(sizeof(struct kiocb *) * nr, GFP_KERNEL);
> + memset(req_arr, 0, sizeof(req_arr));
>
> /*
> * AKPM: should this return a partial result if some of the IOs
> were
> @@ -1653,15 +1656,51 @@ long do_io_submit(aio_context_t ctx_id, long nr,
> break;
> }
>
> - ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat);
> + ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat,
> + &req_arr[i]);
> if (ret)
> break;
> + req_arr_cnt++;
> }
> - blk_finish_plug(&plug);
>
> + spin_lock_irq(&ctx->ctx_lock);
> + /*
> + * We could have raced with io_destroy() and are currently holding
> a
> + * reference to ctx which should be destroyed. We cannot submit IO
> + * since ctx gets freed as soon as io_submit() puts its reference.
> The
> + * check here is reliable: io_destroy() sets ctx->dead before
> waiting
> + * for outstanding IO and the barrier between these two is
> realized by
> + * unlock of mm->ioctx_lock and lock of ctx->ctx_lock.
> Analogously we
> + * increment ctx->reqs_active before checking for ctx->dead and
> the
> + * barrier is realized by unlock and lock of ctx->ctx_lock. Thus
> if we
> + * don't see ctx->dead set here, io_destroy() waits for our IO to
> + * finish.
> + */
> + if (ctx->dead) {
> + spin_unlock_irq(&ctx->ctx_lock);
> + for (i = 0; i < req_arr_cnt; i++)
> + /* drop i/o ref to the req */
> + __aio_put_req(ctx, req_arr[i]);
> +
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + for (i = 0; i < req_arr_cnt; i++) {
> + struct kiocb *req = req_arr[i];
> + if (likely(!kiocbTryKick(req)))
> + __queue_kicked_iocb(req);
> + nr_submitted++;
> + }
> + if (likely(nr_submitted > 0))
> + aio_queue_work(ctx);
> + spin_unlock_irq(&ctx->ctx_lock);
> +
> +out:
> kiocb_batch_free(ctx, &batch);
> + kfree(req_arr);
> put_ioctx(ctx);
> - return i ? i : ret;
> + return nr_submitted ? nr_submitted : ret;
> }
>
> /* sys_io_submit:
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index b1a520e..bcd6a5e 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -124,6 +124,8 @@ struct kiocb {
> * this is the underlying eventfd context to deliver events to.
> */
> struct eventfd_ctx *ki_eventfd;
> +
> + const struct cred *submitter_cred;
> };
>
> #define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
>
> --
> Ankit Jain
> SUSE Labs

--
Rajat Sharma
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/