Re: [syzbot] INFO: rcu detected stall in tx

From: Alan Stern
Date: Wed May 19 2021 - 22:01:24 EST


On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
> Hi Alan,
>
> Sorry if this diverges from the thread, but I've been wondering whether
> to add a change for this also.
>
> For xHCI hosts, after transactions errors, the endpoint will enter
> halted state.

No. You are misreading the xHCI spec. Section 4.6.8 says:

... the state of the associated Endpoint Context is set to
Halted...

Note this carefully. It says "Endpoint Context", not "endpoint".

The endpoint is part of the device, whereas the endpoint context is part
of the host controller. The device doesn't know when a transaction
error has occurred; consequently such errors do not affect the endpoint.
The host controller does know, and consequently such errors do affect
the endpoint context.

> The driver will attempt a few soft-retries before giving
> up. According to the xHCI spec (section 4.6.8), a host may send a
> ClearFeature(endpoint_halt) to recover and restart the transfer (see

Not quite. The section of the spec you're talking about says:

Software shall execute the following sequence to “reset a
pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to
device.

It does not say the host controller will do this; it says that software
will do it.

> "reset a pipe" in xhci spec), and the class driver can handle this after
> receiving something like -EPROTO from xhci.
>
> However, as you've pointed out, some devices don't like
> ClearFeature(ep_halt) and may not properly synchronize with the host on
> where it should restart.
>
> Some OS (such as Windows) do this. Not sure if we also want this?

In general we should do the same thing as Windows does, because most
hardware designers test their equipment on Windows systems but
relatively few test on Linux systems.

> Currently the recovery is just a timeout and a port reset from the class

This depends on the driver. Some perform no recovery at all.

> driver, but the timeout is usually defaulted to a long time (e.g. 30
> seconds for storage class driver).

That 30-second timeout in the mass-storage driver applies in situations
where a command fails to complete, not in situations where it completes
quickly but with a -EPROTO or -EPIPE error.

The fact is that only a small percentage of -EPROTO errors are
recoverable. Some of them can be handled by a port reset, which can be
pretty awkward to perform but does occasionally work. A lot of them
occur because the USB cable has been unplugged; obviously there's no way
to recover from that. With only a few exceptions, the best and simplest
approach is not to try to recover at all.

For the case in question (the syzbot bug report that started this
thread), the class driver doesn't try to perform any recovery. It just
resubmits the URB, getting into a tight retry loop which consumes too
much CPU time. Simply giving up would be preferable.

Alan Stern