Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

From: Alex Pakhunov
Date: Fri Nov 03 2023 - 13:07:32 EST


> This is prone to race conditions if we have more than one TX queue.

Yes, indeed.

> The original driver code only supported one TX queue and the counters
> were never modified properly to support multiple queues. We should
> convert them to per queue counters by moving tx_dropped and rx_dropped
> to the tg3_napi struct.

I'm not super familiar with the recommended approach for handling locks in
network drivers, so I spent a bit of tme looking at what tg3 does.

It seems that there are a few ways to remove the race condition when
working with these counters:

1. Use atomic increments. It is easy but every update is more expensive
than it needs to be. We might be able to say that there specific
counters are updated rarely, so maybe we don't care too much.
2. netif_tx_lock is already taken when tx_droped is incremented - wrap
rx_dropped increment and reading both counters in netif_tx_lock. This
seems legal since tg3_tx() can take netif_tx_lock. I'm not sure how to
order netif_tx_lock and tp->lock, since tg3_get_stats64() takes
the latter. Should netif_tx_lock be takes inside tp->lock? Should they
be not nested?
3. Using tp->lock to protect rx_dropped (tg3_poll_link() already takes it
so it must be legal) and netif_tx_lock to protect tx_dropped.

There are probably other options. Can you recommend an aproach?

Also, this seems like a larger change that should be done separately from
fixing the TX stall. Should we land just "[PATCH v2 2/2]"? Should we land
the whole patch (since it does not make race condition much worse) and fix
the race condition separately?

Alex.