Re: [PATCH v2 1/2] usb: dwc3: Not set DWC3_EP_END_TRANSFER_PENDING in ep cmd fails

From: Thinh Nguyen
Date: Thu Feb 17 2022 - 18:49:52 EST


Wesley Cheng wrote:
> Hi Thinh,
>
> On 2/15/2022 9:14 AM, Thinh Nguyen wrote:
>> Jung Daehwan wrote:
>>> Hi Thinh,
>>>
>>> On Mon, Feb 14, 2022 at 06:44:33PM +0000, Thinh Nguyen wrote:
>>>> Hi,
>>>>
>>>> Daehwan Jung wrote:
>>>>> It always sets DWC3_EP_END_TRANSFER_PENDING in dwc3_stop_active_transfer
>>>>> even if dwc3_send_gadget_ep_cmd fails. It can cause some problems like
>>>>
>>>> How does it fail? Timed out?
>>>
>>> Yes, timed out.
>>>>
>>>>> skipping clear stall commmand or giveback from dequeue. We fix to set it
>>>>> only when ep cmd success. Additionally, We clear DWC3_EP_TRANSFER_STARTED
>>>>> for next trb to start transfer not update transfer.
>>>>
>>>> We shouldn't do this. Things will be out of sync. It may work for 1
>>>> scenario, but it won't work for others.
>>>>
>>>> Please help me understand a few things:
>>>>
>>>> 1) What is the scenario that triggers this? Is it random?
>>>>
>>> ep cmd timeout occurs on dequeue request from user side. End Transfer command
>>> would be sent in dwc3_stop_active transfer.
>>
>> At the high level, what's triggering the request dequeue? Is it from a
>> disconnect, change of interface, or simply function driver protocol that
>> changes it.
>>
>> What application was used to trigger this?
>>
> Sorry for jumping in here, but looks like Daehwan is running into a
> similar issue I am seeing as well.
>
> At least in my scenario, the dequeue is coming from a function driver
> which exposes a device to userspace. Once that device is closed, it
> will issue a dequeue on all pending/submitted requests.

Dequeuing request is coming from the function driver, but what causes
the dequeue. For example, the End Transfer command due to a disconnect
may give a different clues than a dequeue from a change of interface.

>
>>>
>>>> 2) Are there other traffics pending while issuing the End Transfer
>>>> command? If so, what transfer type(s)?
>>>>
>>> I haven't checked it yet.
>>
>> Can you check?
>>
> For the cases where we've collected a crash log, we can see that during
> the END transfer timeouts there was always a pending EP0 transaction.
> We had reached out to our internal HW folks to get some inputs on what
> could be causing this kind of issue, and we were able to get some
> recommendations from their Synopsis POCs.

It's "Synopsys" :)

>
> It was mentioned that if there was an active EP0 transfer, an end
> transfer command on a non-control EP can fail w/ timed out.
>

What controller version are you using? And what version is Jung using?
Do you have the STAR number of the issue. If you're using a different
version than Jung's, then it may not be the same issue.

>>>
>>>> 3) Have you tried increasing the timeout?
>>>>
>>> No, I haven't.
>>
>> Can you try up to 10 seconds (just for experiment)
>>
> I've tried this too, and it did not help.
>
>>>> BR,
>>>> Thinh
>>>>
>>>
>>> This issue occurs very rarely on customer. I only have restricted
>>> information. That's why I've been trying to reproduce it.
>>
>> How did you test your fix if you can't reproduce it?
>>
>>>
>>> Wesley may have run into same issue and you can see this issue in detail.
>>> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=9d423b69-fc3fd32e-9d43b026-74fe485fff30-77a099b52659410d&q=1&e=20b4d9f5-2599-4f57-8b6a-7c4ec167d228&u=https*3A*2F*2Flore.kernel.org*2Flinux-usb*2F20220203080017.27339-1-quic_wcheng*40quicinc.com*2F__;JSUlJSUlJQ!!A4F2R9G_pg!JWPzNLoO3BFX_IZCVzmHPtxq6frr_VFbSNNaxSQylunt1Y4EauTOefth2LCIcVEuTx8E$
>>>
>>
>> I can take a look, but please provide the tracepoints of the failure if
>> you can reproduce it.
>>
> Let me see if I have any previous traces I can share. If not, I have a
> pretty reliable repro set up I can collect a trace for you. For now, I
> will focus on just getting the endxfer timeout seen during ep dequeue.
> As mentioned on my patchset, this can happen during device-initiated
> disconnect as well.
>

Your patch set is still on my todo list. I haven't reviewed it. There's
some concern looking at it from a first glance, I'll check it out more
thoroughly later.

Can you provide the tracepoints?

Thanks,
Thinh