Re: [PATCH] tg3: add new module param to force device power down on reboot

From: Pavan Chebbi
Date: Tue Jan 09 2024 - 23:13:12 EST


On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>
> On 09.01.2024 20:45, Andrea Fois wrote:
> > The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
> > device on system reboot to avoid triggering AER") but was reintroduced
> > by commit 9fc3bc764334 ("tg3: power down device only on
> > SYSTEM_POWER_OFF").
> >
> > The problem described in #1917471 is still consistently replicable on
> > reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
> > (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
> > was committed.
> >
> > The problem is detected also by the Lifecycle controller and logged as
> > a PCI Bus Error for the device.
> >
> > As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
> > opposite strategies, a new module param "force_pwr_down_on_reboot"
> > <bool> is introduced to fix both scenarios:
> >
> Adding module parameters is discouraged. What I see could try:
Ack.
>
> - limit 9fc3bc764334 to the specific machine type mentioned in the
> commit message (based DMI info)
> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
> Based on the commit description disabling the device might be sufficient.

I think the second suggestion could be a better solution. Helps to
solve the issue 9fc3bc764334 is trying to fix.
But I am not sure how easy it is to test. As I recall, Goerge was
unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
test his patch for regression.
We did discuss the risk of this regression.
https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@xxxxxxxxx/
Unfortunately, looks like it has come true :(

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature