Re: [PATCH] USB:bugfix a controller halt error

From: Alan Stern
Date: Wed Jul 26 2023 - 10:20:31 EST


On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote:
> On 2023/7/21 22:57, Alan Stern Wrote:
> > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
> >> On systems that use ECC memory. The ECC error of the memory will
> >> cause the USB controller to halt. It causes the usb_control_msg()
> >> operation to fail.
> >
> > How often does this happen in real life? (Besides, it seems to me that
> > if your system is getting a bunch of ECC memory errors then you've got
> > much worse problems than a simple USB failure!)
> >
>
> This problem is on ECC memory platform.
> In the test scenario, the problem is 100% reproducible.
>
> > And why do you worry about ECC memory failures in particular? Can't
> > _any_ kind of failure cause the usb_control_msg() operation to fail?
> >
> >> At this point, the returned buffer data is an abnormal value, and
> >> continuing to use it will lead to incorrect results.
> >
> > The driver already contains code to check for abnormal values. The
> > check is not perfect, but it should prevent things from going too
> > badly wrong.
> >
>
> If it is ECC memory error. These parameter checks would also
> actually be invalid.
>
> >> Therefore, it is necessary to judge the return value and exit.
> >>
> >> Signed-off-by: liulongfang <liulongfang@xxxxxxxxxx>
> >
> > There is a flaw in your reasoning.
> >
> > The operation carried out here is deliberately unsafe (for full-speed
> > devices). It is made before we know the actual maxpacket size for ep0,
> > and as a result it might return an error code even when it works okay.
> > This shouldn't happen, but a lot of USB hardware is unreliable.
> >
> > Therefore we must not ignore the result merely because r < 0. If we do
> > that, the kernel might stop working with some devices.
> >
> It may be that the handling solution for ECC errors is different from that
> of the OS platform. On the test platform, after usb_control_msg() fails,
> reading the memory data of buf will directly lead to kernel crash:

All right, then here's a proposal for a different way to solve the
problem: Change the kernel's handler for the ECC error notification.
Have it clear the affected parts of memory, so that the kernel can go
ahead and use them without crashing.

It seems to me that something along these lines must be necessary in
any case. Unless the bad memory is cleared somehow, it would never be
usable again. The kernel might deallocate it, then reallocate for
another purpose, and then crash when the new user tries to access it.

In fact, this scenario could still happen even with your patch, which
means the patch doesn't really fix the problem.

Alan Stern