Re: [PATCH] tg3: add new module param to force device power down on reboot

From: Heiner Kallweit
Date: Wed Jan 10 2024 - 02:09:29 EST


On 10.01.2024 05:12, Pavan Chebbi wrote:
> On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>>
>> On 09.01.2024 20:45, Andrea Fois wrote:
>>> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
>>> device on system reboot to avoid triggering AER") but was reintroduced
>>> by commit 9fc3bc764334 ("tg3: power down device only on
>>> SYSTEM_POWER_OFF").
>>>
>>> The problem described in #1917471 is still consistently replicable on
>>> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
>>> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
>>> was committed.
>>>
>>> The problem is detected also by the Lifecycle controller and logged as
>>> a PCI Bus Error for the device.
>>>
>>> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
>>> opposite strategies, a new module param "force_pwr_down_on_reboot"
>>> <bool> is introduced to fix both scenarios:
>>>
>> Adding module parameters is discouraged. What I see could try:
> Ack.
>>
>> - limit 9fc3bc764334 to the specific machine type mentioned in the
>> commit message (based DMI info)
>> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
>> Based on the commit description disabling the device might be sufficient.
>
> I think the second suggestion could be a better solution. Helps to
> solve the issue 9fc3bc764334 is trying to fix.
> But I am not sure how easy it is to test. As I recall, Goerge was
> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> test his patch for regression.
> We did discuss the risk of this regression.
> https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@xxxxxxxxx/
> Unfortunately, looks like it has come true :(

If the culprit is an unexpected MSI interrupt, then you may also address
this directly by disabling interrupts in the device. tg3_stop() may be
a candidate here.