Re: single aio thread is migrated crazily by scheduler

From: Dave Chinner
Date: Sun Dec 01 2019 - 22:09:00 EST


On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote:
> On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > Yeah, the fio task averages 13.4ms on any given CPU before being
> > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > of how long it spends on any one CPU is pretty wide (330us to
> > 330ms).
> >
> Hey Dave
>
> > IOWs, this doesn't look like a workqueue problem at all - this looks
>
> Surprised to see you're so sure it has little to do with wq,

Because I understand how the workqueue is used here.

Essentially, the workqueue is not necessary for a -pure- overwrite
where no metadata updates or end-of-io filesystem work is required.

However, change the workload just slightly, such as allocating the
space, writing into preallocated space (unwritten extents), using
AIO writes to extend the file, using O_DSYNC, etc, and we *must*
use a workqueue as we have to take blocking locks and/or run
transactions.

These may still be very short (e.g. updating inode size) and in most
cases will not block, but if they do, then if we don't move the work
out of the block layer completion context (i.e. softirq running the
block bh) then we risk deadlocking the code.

Not to mention none of the filesytem inode locks are irq safe.

IOWs, we can remove the workqueue for this -one specific instance-
but it does not remove the requirement for using a workqueue for all
the other types of write IO that pass through this code.

> > like the scheduler is repeatedly making the wrong load balancing
> > decisions when mixing a very short runtime task (queued work) with a
> > long runtime task on the same CPU....
> >
> and it helps more to know what is driving lb to make decisions like
> this.

I know exactly what is driving it through both observation and
understanding of the code, and I've explained it elsewhere
in this thread.

> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> WRITE_ONCE(dio->submit.waiter, NULL);
> blk_wake_io_task(waiter);
> } else if (dio->flags & IOMAP_DIO_WRITE) {
> - struct inode *inode = file_inode(dio->iocb->ki_filp);
> -
> INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> + schedule_work(&dio->aio.work);

This does nothing but change the workqueue from a per-sb wq to the
system wq. The work is still bound to the same CPU it is queued on,
so nothing will change.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx