Re: [PATCH] USB:bugfix a controller halt error

From: liulongfang
Date: Thu Jul 27 2023 - 03:09:58 EST


On 2023/7/26 22:20, Alan Stern wrote:
> On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote:
>> On 2023/7/21 22:57, Alan Stern Wrote:
>>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>>>> On systems that use ECC memory. The ECC error of the memory will
>>>> cause the USB controller to halt. It causes the usb_control_msg()
>>>> operation to fail.
>>>
>>> How often does this happen in real life? (Besides, it seems to me that
>>> if your system is getting a bunch of ECC memory errors then you've got
>>> much worse problems than a simple USB failure!)
>>>
>>
>> This problem is on ECC memory platform.
>> In the test scenario, the problem is 100% reproducible.
>>
>>> And why do you worry about ECC memory failures in particular? Can't
>>> _any_ kind of failure cause the usb_control_msg() operation to fail?
>>>
>>>> At this point, the returned buffer data is an abnormal value, and
>>>> continuing to use it will lead to incorrect results.
>>>
>>> The driver already contains code to check for abnormal values. The
>>> check is not perfect, but it should prevent things from going too
>>> badly wrong.
>>>
>>
>> If it is ECC memory error. These parameter checks would also
>> actually be invalid.
>>
>>>> Therefore, it is necessary to judge the return value and exit.
>>>>
>>>> Signed-off-by: liulongfang <liulongfang@xxxxxxxxxx>
>>>
>>> There is a flaw in your reasoning.
>>>
>>> The operation carried out here is deliberately unsafe (for full-speed
>>> devices). It is made before we know the actual maxpacket size for ep0,
>>> and as a result it might return an error code even when it works okay.
>>> This shouldn't happen, but a lot of USB hardware is unreliable.
>>>
>>> Therefore we must not ignore the result merely because r < 0. If we do
>>> that, the kernel might stop working with some devices.
>>>
>> It may be that the handling solution for ECC errors is different from that
>> of the OS platform. On the test platform, after usb_control_msg() fails,
>> reading the memory data of buf will directly lead to kernel crash:
>
> All right, then here's a proposal for a different way to solve the
> problem: Change the kernel's handler for the ECC error notification.
> Have it clear the affected parts of memory, so that the kernel can go
> ahead and use them without crashing.
>
> It seems to me that something along these lines must be necessary in
> any case. Unless the bad memory is cleared somehow, it would never be
> usable again. The kernel might deallocate it, then reallocate for
> another purpose, and then crash when the new user tries to access it.
>
> In fact, this scenario could still happen even with your patch, which
> means the patch doesn't really fix the problem.
>

This patch is only used to prevent data in the buffer from being accessed.
As long as the data is not accessed, the kernel does not crash.

thanks,
Longfang.

> Alan Stern
>
> .
>