Re: [PATCH] tg3: driver sleeps indefinitely when EEH errors exceed eeh_max_freezes

From: David Christensen
Date: Mon Jun 15 2020 - 18:21:20 EST


On 6/15/20 1:45 PM, Michael Chan wrote:
On Mon, Jun 15, 2020 at 12:01 PM David Christensen
<drc@xxxxxxxxxxxxxxxxxx> wrote:

The driver function tg3_io_error_detected() calls napi_disable twice,
without an intervening napi_enable, when the number of EEH errors exceeds
eeh_max_freezes, resulting in an indefinite sleep while holding rtnl_lock.

The function is called once with the PCI state pci_channel_io_frozen and
then called again with the state pci_channel_io_perm_failure when the
number of EEH failures in an hour exceeds eeh_max_freezes.

Protecting the calls to napi_enable/napi_disable with a new state
variable prevents the long sleep.

This works, but I think a simpler fix is to check tp->pcierr_recovery
in tg3_io_error_detected() and skip most of the tg3 calls (including
the one that disables NAPI) if the flag is true.

This might be the smallest change that would work. Does it make sense to the reader?

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 7a3b22b35238..1f37c69d213d 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -18168,8 +18168,8 @@ static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev,

rtnl_lock();

- /* We probably don't have netdev yet */
- if (!netdev || !netif_running(netdev))
+ /* May be second call or maybe we don't have netdev yet */
+ if (tp->pcierr_recovery || !netdev || !netif_running(netdev))
goto done;

/* We needn't recover from permanent error */