Re: [PATCH] USB:bugfix a controller halt error

From: Alan Stern
Date: Thu Jul 27 2023 - 10:42:39 EST


On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
> On 2023/7/26 22:20, Alan Stern wrote:
> >> It may be that the handling solution for ECC errors is different from that
> >> of the OS platform. On the test platform, after usb_control_msg() fails,
> >> reading the memory data of buf will directly lead to kernel crash:
> >
> > All right, then here's a proposal for a different way to solve the
> > problem: Change the kernel's handler for the ECC error notification.
> > Have it clear the affected parts of memory, so that the kernel can go
> > ahead and use them without crashing.
> >
> > It seems to me that something along these lines must be necessary in
> > any case. Unless the bad memory is cleared somehow, it would never be
> > usable again. The kernel might deallocate it, then reallocate for
> > another purpose, and then crash when the new user tries to access it.
> >
> > In fact, this scenario could still happen even with your patch, which
> > means the patch doesn't really fix the problem.
> >
>
> This patch is only used to prevent data in the buffer from being accessed.
> As long as the data is not accessed, the kernel does not crash.

I still don't understand. You haven't provided nearly enough
information. You should start by answering the questions that Oliver
asked. Then answer this question:

The code you are concerned about is this:

r = usb_control_msg(udev, usb_rcvaddr0pipe(),
USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
USB_DT_DEVICE << 8, 0,
buf, GET_DESCRIPTOR_BUFSIZE,
initial_descriptor_timeout);
switch (buf->bMaxPacketSize0) {

You're worried that if an ECC memory error occurs during the
usb_control_msg transfer, the kernel will crash when the "switch"
statement tries to read the value of buf->bMaxPacketSize0. That's a
reasonable thing to worry about.

Now think about what will happen if usb_control_msg works successfully
but an ECC memory error occurs when the return code from the function
call is stored in r? Won't the kernel crash then? Or if not then, when
it reads the value of r a few lines later?

So why bother to handle the first kind of ECC error but not the second?
What makes one ECC error more important than another?

Alan Stern