Re: [PATCH v3] usb: dwc3: gadget: Propagate core init errors to UDC during pullup

From: Krishna Kurapati PSSNV
Date: Tue Jun 20 2023 - 00:54:13 EST




On 6/19/2023 8:36 PM, Johan Hovold wrote:
On Mon, Jun 19, 2023 at 06:20:43PM +0530, Krishna Kurapati PSSNV wrote:
On 6/19/2023 12:36 PM, Johan Hovold wrote:
On Sun, Jun 18, 2023 at 05:39:49PM +0530, Krishna Kurapati wrote:

@@ -2747,7 +2747,9 @@ static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
ret = pm_runtime_get_sync(dwc->dev);
if (!ret || ret < 0) {
pm_runtime_put(dwc->dev);
- return 0;
+ if (ret < 0)
+ pm_runtime_set_suspended(dwc->dev);

This bit is broken and is also not mentioned or explained in the commit
message. What are you trying to achieve here?

You cannot set the state like this after runtime PM is enabled and the
above call will always fail.

The reason why I an returning ret is because, when the first get_sync
fails because of core_init failure and we return 0 instead of ret, the
UDC thinks that controller has started successfully but we never set the
run stop bit.

That bit is clear.

So when we plug out the cable, the disconnect event won't
be generated and we never send on systems like android the user space
will never clear the UDC upon disconnect. Its a sort of mismatch between
controller and udc.

Ok, but the controller is an error state after the resume failure. And
here you rely on user space to retry gadget activation in order to
eventually detect the disconnect event?
Also once the first get_sync fails, the dwc->dev->power.runtime_error
flag is set and successive calls to get_sync always return -EINVAL. In
this situation even if UDC/configfs retry pullup, resume_common will
never be called and we never actually start the controller or resume
dwc->dev.

By calling set_suspended, I am trying to clear the runtime_error flag so
that the next retry to pullup will call resume_common and retry
core_init and set run_stop.

Ok, thanks, that's the bit I was missing in the commit message.

First, I perhaps mistakingly thought pm_runtime_set_suspended() may only
be called with PM runtime disabled, but it appears it may indeed be
valid to call also after an error but with the caveat that the device
must then actually be in the suspended state.

The documentation and implementation is inconsistent here as the kernel
doc for pm_runtime_set_suspended() clearly states:

It is not valid to call this function for devices with runtime
PM enabled.

and it also looks like we'd end up with an active-child counter
imbalance if anyone actually tries to do so.

But either way, it also seems like the controller is not guaranteed to
be suspended here as pm_runtime_get_sync() may also fail after a
previous errors that have left the controller in the active state?
> Also, what kind of errors would cause core_init and resume to fail here?

Hi Johan,

As per the comment just above the get_sync during pullup, the resume_common path is used to resume the controller and start peripheral mode incase the dwc3 was in suspended state. So if we are entering gadget_resume we are in suspended state the first time it is called.

Regarding the errors that might leave controller in active state, I have faced issue in core init, and in that function there is a cleanup happening in case something fails. So controller was actually not in active state after cleanup was done. In resume common, if core_init_for_resume is failing, we cleanup everything initialized up until that point.

The scenario you mentioned would be applicable in case gadget_resume fails. We are not having a return value check and I am not sure what would be the side effect of not having that check there. Either ways, since it was failing for core init, I went ahead and made this patch.

As per the reason for failure in core init, the following is what was happening at customer's end:

1. Cable plug-in
2. get_sync calls resume common which inturn calls core_init
3. core soft reset fails in core init, we cleanup and return -110
4. After applying this patch, the -110 was propagated to UDC properly
5. We got a second call to pullup via connect_control and this time reset was successful.

The behavior was similar to [1]. There as well, on all Gen-2 targets, after the retry happens I see soft reset is passing but failing for first attempt.

[1]: https://lore.kernel.org/all/20230510075252.31023-2-quic_kriskura@xxxxxxxxxxx/

Regards,
Krishna,

If this is something that you see during normal operation then this
seems to suggest that something is wrong with the runtime pm
implementation.

Note that virtually all drivers treat resume failures as fatal errors
and do not implement any recovery from that.

In fact, the only other example of this kind of usage that I could find
is also for a Qualcomm driver...

Johan