Re: [PATCH] usb: dwc3: fix race of usb_gadget_driver operation

From: linyyuan
Date: Mon Jun 28 2021 - 05:36:51 EST


On 2021-06-27 22:09, Alan Stern wrote:
On Sun, Jun 27, 2021 at 10:48:56AM +0800, linyyuan@xxxxxxxxxxxxxx wrote:
On 2021-06-26 23:03, Alan Stern wrote:
> On Sat, Jun 26, 2021 at 09:16:25AM +0800, linyyuan@xxxxxxxxxxxxxx wrote:
> > On 2021-06-26 00:37, Alan Stern wrote:

> > > Here and in the other places, you should test dwc->async_callbacks
> > > _before_ dropping the spinlock. Otherwise there is a race (the flag
> > > could be written at about the same time it is checked).
> > thanks for your comments,
> >
> > if you think there is race here, how to make sure gadget_driver
> > pointer is
> > safe,
> > this is closest place where we can confirm it is non-NULL by checking
> > async_callbacks ?
>
> I explained this twice already: We know that gadget_driver is not
> NULL because usb_gadget_remove_driver calls synchronize_irq before
> doing usb_gadget_udc_stop.
>
> Look at this timing diagram:
>
> CPU0 CPU1
> ---- ----
> IRQ happens for setup packet
> Handler sees async_callbacks
> is enabled
> Handler unlocks dwc->lock
> usb_gadget_remove_driver runs
> Disables async callbacks
> Calls synchronize_irq
> Handler calls dwc-> . waits for IRQ handler to
> gadget_driver->setup . return
> Handler locks dwc-lock .
> ... .
> Handler returns .
> . synchronize_irq returns
> Calls usb_gadget_udc_stop
> dwc->gadget_driver is
> set to NULL
>
> As you can see, dwc->gadget_driver is non-NULL when CPU0 uses it,
> even though async_callbacks gets cleared during the time when the
> lock is released.
thanks for your patient explanation,
but from this part, seem it is synchronize_irq() help to avoid NULL pointer
crash.

That's right.

can you also explain how async_callbacks flag help here ?

It doesn't help in the situation shown above, but it does help in other
situations. Consider this timing diagram:

CPU0 CPU1
---- ----
usb_gadget_remove_driver runs
Disables async callbacks
Calls synchronize_irq
synchronize_irq returns
Calls udc_driver_unbind
IRQ happens for disconnect
Handler sees async_callbacks
is disabled
Handler returns
Calls usb_gadget_udc_stop
dwc->gadget_driver is
set to NULL

With the async_callbacks check, everything works okay. But now look at
what would happen without the async_callbacks mechanism:

CPU0 CPU1
---- ----
usb_gadget_remove_driver runs
Calls synchronize_irq
synchronize_irq returns
Calls udc_driver_unbind
IRQ happens for disconnect
Handler unlocks dwc->lock
Calls dwc->gadget_driver->disconnect
Gadget driver has already been unbound
and is not prepared to handle a
callback, so it crashes
Calls usb_gadget_udc_stop
dwc->gadget_driver is
set to NULL

Without the async_callbacks mechanism, the gadget driver can get a
callback at the wrong time (after it has been unbound), which might
cause it to crash.
1. do you think we need to back to my original patch,
https://lore.kernel.org/linux-usb/20210619154309.52127-1-linyyuan@xxxxxxxxxxxxxx/T/#t

i think we can add the spin lock or mutex lock to protect this kind of race from UDC layer, it will be easy understanding.


2. if you insist this kind of change, how to change following code in dwc3 ?
if (dwc->gadget_driver && dwc->gadget_driver->disconnect) {

2.1 if (dwc->async_callbacks && dwc->gadget_driver->disconnect) {
or
2.2 if (dwc->async_callbacks && vdwc->gadget_driver && dwc->gadget_driver->disconnect) {



Alan Stern