Re: [PATCH] USB:bugfix a controller halt error

From: Alan Stern
Date: Thu Jul 27 2023 - 11:57:56 EST


On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote:
> On 27.07.23 16:42, Alan Stern wrote:
> > On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
> > > On 2023/7/26 22:20, Alan Stern wrote:
>
> > > > It seems to me that something along these lines must be necessary in
> > > > any case. Unless the bad memory is cleared somehow, it would never be
> > > > usable again. The kernel might deallocate it, then reallocate for
> > > > another purpose, and then crash when the new user tries to access it.
> > > >
> > > > In fact, this scenario could still happen even with your patch, which
> > > > means the patch doesn't really fix the problem.
>
> I suppose in theory you could have something like a bad blocks list
> just for RAM, but that would really hurt. You'd have to do something
> about every DMA operation in every driver in theory.
>
> Error handling would basically be an intentional memory leak.

I started out thinking this way, but maybe that's not how it works.
Perhaps simply overwriting the part of memory that got the ECC error
would clear the error state. (This may depend on the kind of error,
one-time vs. permanent.)

If that's the case, and if the memory buffer was deallocated without
being accessed and then later reallocated, things would be okay. The
routine that reallocated the buffer wouldn't try to read from it before
initializing it somehow.

> > > This patch is only used to prevent data in the buffer from being accessed.
> > > As long as the data is not accessed, the kernel does not crash.
> >
> > I still don't understand. You haven't provided nearly enough
> > information. You should start by answering the questions that Oliver
> > asked. Then answer this question:
> >
> > The code you are concerned about is this:
> >
> > r = usb_control_msg(udev, usb_rcvaddr0pipe(),
> > USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
> > USB_DT_DEVICE << 8, 0,
> > buf, GET_DESCRIPTOR_BUFSIZE,
> > initial_descriptor_timeout);
> > switch (buf->bMaxPacketSize0) {
> >
> > You're worried that if an ECC memory error occurs during the
> > usb_control_msg transfer, the kernel will crash when the "switch"
> > statement tries to read the value of buf->bMaxPacketSize0. That's a
> > reasonable thing to worry about.
>
> Albeit unlikely. If the hardware and implementation are reasonable
> you'd return a specific error code from the HCD and clean up the
> RAM in your ecc driver.
>
> The fix for USB would then conceptually be something like
>
> retryio:
> r = usb_control_msg()
> if (r == -EMEMORYCORRUPTION)
> goto retryio;

Yes, we could do this, but it's not necessary. Let's say that the HCD
returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM
(probably by resetting its contents to 0, but possibly leaving garbage
there instead). Then when the following code in hub_port_init() tests
buf->bMaxPacketSize0, it will see an invalid value and will retry the
transfer.

Or, with low probability, it will see a valid but incorrect value. If
that happens then later transfers using ep0 will fail, causing the hub
driver to reiterate the outer loop in hub_port_connect(). Eventually
the device will be correctly initialized and enumerated.

Alan Stern