Re: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

From: Andrea Vai
Date: Wed Nov 27 2019 - 10:01:37 EST


Il giorno mer, 27/11/2019 alle 21.08 +0800, Ming Lei ha scritto:
> On Wed, Nov 27, 2019 at 10:39:40AM +0100, Andrea Vai wrote:
> > Il giorno mer, 27/11/2019 alle 10.05 +0800, Ming Lei ha scritto:
> > >
> > >
> > > It can be workaround via the following change:
> > >
> > > /lib/modules/5.4.0+/build/include/generated/autoconf.h:
> > >
> > > //#define CONFIG_CC_HAS_ASM_INLINE 1
> >
> > Thanks, it worked, trace attached. Produced by: start the trace
> script
> > (with the pendrive already plugged), wait some seconds, run the
> test
> > (1 trial, 1 GB), wait for the test to finish, stop the trace.
> >
> > The copy took 2659 seconds, roughly as already seen before.
>
> Thanks for collecting the log.
>
> From the log, some of write IOs are out-of-order, such as, the 1st
> one
> is 378880.
>
> 16.41240 2 266 266 kworker/2:1H block_rq_issue b'W'
> 370656 240
> 16.41961 3 485 485 kworker/3:1H block_rq_issue b'W'
> 378880 240
> 16.73729 2 266 266 kworker/2:1H block_rq_issue b'W'
> 370896 240
> 17.71161 2 266 266 kworker/2:1H block_rq_issue b'W'
> 379120 240
> 18.02344 2 266 266 kworker/2:1H block_rq_issue b'W'
> 371136 240
> 18.94314 3 485 485 kworker/3:1H block_rq_issue b'W'
> 379360 240
> 19.25624 2 266 266 kworker/2:1H block_rq_issue b'W'
> 371376 240
>
> IO latency is increased a lot since the 1st out-of-order request(usb
> storage HBA is single queue depth, one request can be issued only
> if
> the previous issued request is completed).
>
> The reason is that there are two kind of tasks which inserts rq to
> device.
> One is the 'cp' process, the other is kworker/u8:*. The out-of-
> order
> happens during the two task's interleaving.
>
> Under such situation, I believe that the old legacy IO path may not
> guarantee the order too. In blk_queue_bio(), after get_request()
> allocates one request, the queue lock is released. And request is
> actually inserted & issued from blk_flush_plug_list() under the
> branch of 'if (plug)'. If requests are from two tasks, then request
> is inserted/issued from two plug list, and no order can be
> guaranteed.
>
> In my test, except for several requests from the beginning, all
> other
> requests are inserted via the kworker thread(guess it is writeback
> wq),
> that is why I can't observe the issue in my test.
>
> As Schmid suggested, you may run the same test on old kernel with
> legacy io path, and see if the performance is still good.
>
> Also, could you share the following info about your machine? So that
> I can build my VM guest in this setting for reproducing your
> situation
> (requests are inserted from two types of threads).
>
> - lscpu
attached,

> - free -h
total used free shared buff/cache available
Mem: 23Gi 4,2Gi 11Gi 448Mi 7,0Gi 18Gi
Swap: 3,7Gi 0B 3,7Gi

> - lsblk -d $USB_DISK

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdg 8:96 1 28,8G 0 disk


> - exact commands for mount the disk, and running the copy operation

I attached the whole script to this thread, I attach it again to this
message and copy the relevant lines here:

mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
SECONDS=0
cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
umount /mnt/pendrive 2>&1 |tee -a $logfile

Meanwhile, I am going on with the further tests as suggested

Thanks,
Andrea
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i5-4430 CPU @ 3.00GHz
Stepping: 3
CPU MHz: 1674.727
CPU max MHz: 3200,0000
CPU min MHz: 800,0000
BogoMIPS: 5986.16
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

Attachment: test
Description: application/shellscript