Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

From: Ming Lei
Date: Tue Dec 24 2019 - 04:35:48 EST


On Tue, Dec 24, 2019 at 09:51:16AM +0100, Andrea Vai wrote:
> Il giorno mar, 24/12/2019 alle 09.27 +0800, Ming Lei ha scritto:
> > Hi Ted,
> >
> > On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> > > On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > > > basically, it's:
> > > >
> > > > mount UUID=$uuid /mnt/pendrive
> > > > SECONDS=0
> > > > cp $testfile /mnt/pendrive
> > > > umount /mnt/pendrive
> > > > tempo=$SECONDS
> > > >
> > > > and it copies one file only. Anyway, you can find the whole
> > script
> > > > attached.
> > >
> > > OK, so whether we are doing the writeback at the end of cp, or
> > when
> > > you do the umount, it's probably not going to make any
> > difference. We
> > > can get rid of the stack trace in question by changing the script
> > to
> > > be basically:
> > >
> > > mount UUID=$uuid /mnt/pendrive
> > > SECONDS=0
> > > rm -f /mnt/pendrive/$testfile
> > > cp $testfile /mnt/pendrive
> > > umount /mnt/pendrive
> > > tempo=$SECONDS
> > >
> > > I predict if you do that, you'll see that all of the time is spent
> > in
> > > the umount, when we are trying to write back the file.
> > >
> > > I really don't think then this is a file system problem at
> > all. It's
> > > just that USB I/O is slow, for whatever reason. We'll see a stack
> > > trace in the writeback code waiting for the I/O to be completed,
> > but
> > > that doesn't mean that the root cause is in the writeback code or
> > in
> > > the file system which is triggering the writeback.
> >
> > Wrt. the slow write on this usb storage, it is caused by two
> > writeback
> > path, one is the writeback wq, another is from ext4_release_file()
> > which
> > is triggered from exit_to_usermode_loop().
> >
> > When the two write path is run concurrently, the sequential write
> > order
> > is broken, then write performance drops much on this particular usb
> > storage.
> >
> > The ext4_release_file() should be run from read() or write() syscall
> > if
> > Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected
> > behavior
> > for ext4_release_file() to be run thousands of times when just
> > running 'cp' once, see comment of ext4_release_file():
> >
> > /*
> > * Called when an inode is released. Note that this is
> > different
> > * from ext4_file_open: open gets called at every open, but
> > release
> > * gets called only when /all/ the files are closed.
> > */
> > static int ext4_release_file(struct inode *inode, struct file
> > *filp)
> >
> > >
> > > I suspect the next step is use a blktrace, to see what kind of I/O
> > is
> > > being sent to the USB drive, and how long it takes for the I/O to
> > > complete. You might also try to capture the output of "iostat -x
> > 1"
> > > while the script is running, and see what the difference might be
> > > between a kernel version that has the problem and one that
> > doesn't,
> > > and see if that gives us a clue.
> >
> > That isn't necessary, given we have concluded that the bad write
> > performance is caused by broken write order.
> >
> > >
> > > > > And then send me
> > > > btw, please tell me if "me" means only you or I cc: all the
> > > > recipients, as usual
> > >
> > > Well, I don't think we know what the root cause is. Ming is
> > focusing
> > > on that stack trace, but I think it's a red herring..... And if
> > it's
> > > not a file system problem, then other people will be best suited
> > to
> > > debug the issue.
> >
> > So far, the reason points to the extra writeback path from
> > exit_to_usermode_loop().
> > If it is not from close() syscall, the issue should be related with
> > file reference
> > count. If it is from close() syscall, the issue might be in 'cp''s
> > implementation.
> >
> > Andrea, please collect the following log or the strace log requested
> > by Ted, then
> > we can confirm if the extra writeback is from close() or
> > read/write() syscall:
> >
> > # pass PID of 'cp' to this script
> > #!/bin/sh
> > PID=$1
> > /usr/share/bcc/tools/trace -P $PID -t -C \
> > 't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector,
> > args->nr_sector' \
> > 't:syscalls:sys_exit_close ' \
> > 't:syscalls:sys_exit_read ' \
> > 't:syscalls:sys_exit_write '
>
> Meanwhile, I tried to run the test and obtained an error (...usage:
> trace [-h] [-b BUFFER_PAGES] [-p PID]...), so assumed the "-P" should
> be "-p", corrected and obtained the attached log with ext4 and a slow
> copy (2482 seconds) by doing:
>
> - start the test
> - look at the cp pid
> - run the trace
> - wait for the test to finish
> - stop the trace.

The log shows all io submission is from close() syscall, so fs code
is fine, and I have provided the reason of this issue in last email:

https://lore.kernel.org/linux-scsi/e3dc2a3e0221c0a0beb91172ba2bff1f6acc0cb7.camel@xxxxxxxx/T/#m845caca2969da5676516c35dc0c3528a79beb886

Thanks,
Ming