Re: [PATCH] usb: hub: move resume delay at the head of all USB access functions

From: Alan Stern
Date: Thu Jan 02 2020 - 16:37:16 EST


On Wed, 25 Dec 2019, AceLan Kao wrote:

> Here[1] are the dmesg and the usbmon log from wireshark, and
> /sys/kernel/debug/usb/usbmon/0u.
>
> I verified this issue on Dell XPS 13 + Dell Salomon WD19 docking
> station(plug-in 3 USB disk on it)
> After s2idle 7 times, 2 usb disks lost. But from wireshark log, the
> packets look normal, no error.
>
> So, I re-do the test again and log the usbmon/0u output, but it's greek to me.
> Hope you can help to find some clues in the logs.
> Thanks.
>
> 1. https://people.canonical.com/~acelan/bugs/usb_issue/

Maybe Mathias can help figure out what's going on. It clearly is an
xHCI-related problem of some sort.

I can't get much out of these logs. For one thing, the time period
covered by the usbmon trace is different from the time period in your
dmesg log. When you collect two kinds of logs for a test, it's
important that they should be collected at the _same_ time! Otherwise
they record different events, which is no use.

For another, your usbmon trace used the 0u file, but you should have
used the 2u file instead. All the errors you get involve devices on
bus 2; including other buses in the trace just makes it more confusing
and causes more entries to be dropped.

Also, I don't understand why your Dell Dock devices show up the way
they do. The SuperSpeed dock is on bus 2, and the high speed dock is
on bus 3, which is attached to a different xHCI controller! A hub
(which is what a dock is) should show up twice, and both instances
should be attached to the same controller.

In any case, it's clear that you've got some problem involving the xHCI
controller and the SuperSpeed Dell Dock (device 2-1). Here's an
excerpt from the dmesg log showing what goes wrong. This occurs during
a system resume; apparently the dock's connection drops while the
system is suspended and there are errors when the system tries to
re-activate it:

[ 721.507202] usb 2-1: USB disconnect, device number 32
[ 721.507206] usb 2-1.3: USB disconnect, device number 33
[ 721.507209] usb 2-1.3.1: USB disconnect, device number 35
[ 721.540214] usb 2-1.3.3: USB disconnect, device number 36
[ 721.584518] usb 2-1.4: USB disconnect, device number 34
[ 721.647466] xhci_hcd 0000:00:0d.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.

That's the first problem.

[ 721.919433] usb 2-1: new SuperSpeedPlus Gen 2 USB device number 38 using xhci_hcd
[ 721.942597] usb 2-1: New USB device found, idVendor=0bda, idProduct=0487, bcdDevice= 1.47
[ 721.942600] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 721.942602] usb 2-1: Product: Dell dock
[ 721.942604] usb 2-1: Manufacturer: Dell Inc.
[ 721.944849] hub 2-1:1.0: USB hub found
[ 721.945232] hub 2-1:1.0: 4 ports detected
[ 722.178395] hub 2-1:1.0: hub_ext_port_status failed (err = -71)
[ 722.178570] usb 2-1-port3: attempt power cycle
[ 722.559212] usb 2-1: USB disconnect, device number 38
[ 722.559311] usb 2-1: Failed to suspend device, error -19

And that's a second error. Then the system tries again:

[ 722.831431] usb 2-1: new SuperSpeedPlus Gen 2 USB device number 43 using xhci_hcd
[ 722.854608] usb 2-1: New USB device found, idVendor=0bda, idProduct=0487, bcdDevice= 1.47
[ 722.854611] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 722.854614] usb 2-1: Product: Dell dock
[ 722.854616] usb 2-1: Manufacturer: Dell Inc.
[ 722.857034] hub 2-1:1.0: USB hub found
[ 722.857440] hub 2-1:1.0: 4 ports detected
[ 723.090106] hub 2-1:1.0: hub_ext_port_status failed (err = -71)
[ 723.090316] usb 2-1-port3: attempt power cycle
[ 723.471212] usb 2-1: USB disconnect, device number 43
[ 723.471274] usb 2-1: Failed to suspend device, error -19

And it fails in the same way. A third try ends up succeeding:

[ 723.743465] usb 2-1: new SuperSpeedPlus Gen 2 USB device number 48 using xhci_hcd
...

I don't see why the device numbers jump around in such a crazy manner.
When device 38 disconnects, the next device is number 43. And when
that fails, the next number is 48! It looks like device numbers are
somehow being allocated for the four child devices even though the log
doesn't show them being detected.

Here's a portion of the usbmon trace that apparently corresponds
somewhat to part of the log above, although it's hard to be certain.
In this trace, device 77 on bus 2 is 2-1. Unlike the log above, it
only has a child attached to port 3, not to port 4:

ffff9879b43ff840 291366827 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff9879b43ff840 291366846 C Ci:2:001:0 0 4 = 03020000
ffff9879b2f54780 291469875 S Ci:2:077:0 s a3 00 0000 0001 0004 4 <
ffff9879b2f54780 291470062 C Ci:2:077:0 0 4 = a0020000
ffff9879b2f54780 291470069 S Ci:2:077:0 s a3 00 0000 0002 0004 4 <
ffff9879b2f54780 291470208 C Ci:2:077:0 0 4 = a0020000
ffff9879b2f54780 291470214 S Ci:2:077:0 s a3 00 0000 0003 0004 4 <
ffff9879b2f54780 291470387 C Ci:2:077:0 0 4 = 03021000
ffff9879b2f54780 291470392 S Co:2:077:0 s 23 01 0014 0003 0000 0
ffff9879b2f54780 291470527 C Co:2:077:0 0 0
ffff9879b2f54780 291470533 S Ci:2:077:0 s a3 00 0000 0004 0004 4 <
ffff9879b2f54780 291470709 C Ci:2:077:0 0 4 = a0020000
ffff9879b43ff480 291573804 S Ii:2:077:1 -115:128 2 <
ffff9879b2f54780 291573839 S Ci:2:077:0 s a3 00 0000 0003 0004 4 <
ffff9879b2f54780 291574068 C Ci:2:077:0 0 4 = 03020000
ffff9879b2f54780 291574166 S Ci:2:077:0 s a3 00 0000 0003 0004 4 <
ffff9879b2f54780 291574363 C Ci:2:077:0 0 4 = 03020000
ffff9879b2f54780 291574383 S Co:2:077:0 s 23 03 0004 0003 0000 0
ffff9879b2f54780 291574580 C Co:2:077:0 0 0
ffff9879b2f54780 291593804 S Ci:2:077:0 s a3 00 0002 0003 0008 8 <
ffff9879ae2969c0 291595804 C Ii:2:001:1 0:2048 1 = 02
ffff9879ae2969c0 291595824 S Ii:2:001:1 -115:2048 4 <
ffff9879ba3476c0 291595843 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff9879ba3476c0 291595857 C Ci:2:001:0 0 4 = c0024100

As far as I can tell, the device was working fine until at this point
it disconnected itself.

ffff9879ba3476c0 291595863 S Co:2:001:0 s 23 01 0010 0001 0000 0
ffff9879ba3476c0 291595867 C Co:2:001:0 0 0
ffff9879ba3476c0 291595870 S Co:2:001:0 s 23 01 0019 0001 0000 0
ffff9879ba3476c0 291595875 C Co:2:001:0 0 0
ffff9879ba3476c0 291595879 S Co:2:001:0 s 23 03 001c 0001 0000 0
ffff9879ba3476c0 291595884 C Co:2:001:0 0 0
ffff9879b2f54780 291600949 C Ci:2:077:0 -71 0
ffff9879b2f54780 291600997 S Co:2:077:0 s 23 03 0004 0003 0000 0
ffff9879b2f54780 291601000 E Co:2:077:0 -19 0

Mathias, any suggestions on how Acelan can track down the reasons for
these problems?

Alan Stern