Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

From: Alex Pakhunov
Date: Sun Nov 05 2023 - 14:27:04 EST


> I recommend using per queue counters as briefly mentioned in my
> earlier reply...
> tg3_get_stats64() can just loop and sum all the tx_dropped and
> rx_dropped counters in each tg3_napi struct. We don't worry about
> locks here since we are just reading.

Got it. So the core idea is to make sure there is a single writer for each
counter which will make updating the counter race-free. It does not keep
reading the counters from multiple queues completely race free, but, I
guess, the assumption is that computing the aggregate counter to be
slightly wrong is acceptable - it will be recomputed correctly next time.

There is still some gotchas on 32 bit machines though. 64 bit reads are not
atomic there, so we have to make the counters 32bit to compensate:

====
@@ -11895,6 +11898,9 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
{
struct rtnl_link_stats64 *old_stats = &tp->net_stats_prev;
struct tg3_hw_stats *hw_stats = tp->hw_stats;
+ uintptr_t rx_dropped = 0;
+ uintptr_t tx_dropped = 0;
+ int i;

stats->rx_packets = old_stats->rx_packets +
get_stat64(&hw_stats->rx_ucast_packets) +
@@ -11941,8 +11947,27 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
stats->rx_missed_errors = old_stats->rx_missed_errors +
get_stat64(&hw_stats->rx_discards);

- stats->rx_dropped = tp->rx_dropped;
- stats->tx_dropped = tp->tx_dropped;
+ /* Aggregate per-queue counters. Each per-queue counter is updated by
+ * a single writer, race-free. The aggregare counters might be not
+ * completely accurate (if an update happens in the middle of the loop)
+ * but they will be recomputed correctly the next time this function is
+ * called. This avoids explicit synchronization between this function
+ * and tg3_rx()/tg3_start_xmit().
+ **/
+ for (i = 0; i < tp->irq_cnt; i++) {
+ struct tg3_napi *tnapi = &tp->napi[i];
+
+ rx_dropped += tnapi->rx_dropped;
+ tx_dropped += tnapi->tx_dropped;
+ }
+
+ /* Since we are using uintptr_t, these counters wrap around at 4G on
+ * a 32bit machine. This seems like an acceptable price for being
+ * able to read them atomically in the loop above.
+ */
+ stats->rx_dropped = rx_dropped;
+ stats->tx_dropped = tx_dropped;
+
}
====

An alternative implementation would use atomic64_add to update
tg3::[rt]x_dropped. It would allow the counters to be 64 bit even on 32 bit
machines. The downside is that updating the counter will be slightly more
expensive. There counters are not updated often, so the cost is negligible.
ALthough it also means that preactically speaking we don't care if
the counters are effectively 32 bits wide.

I'll assume you prefer the former implementation for now, but let me know
if this not the case.

> Yes, we can merge patch #2 first which fixes the stall. Please repost
> just patch #2 standalone if you want to do that.

OK, I posted "[PATCH v3] tg3: Fix the TX ring stall".

Alex.