Re: [PATCH] net: stmmac: protect statistics updates with a spinlock

From: Petr Tesařík
Date: Fri Jan 05 2024 - 09:49:33 EST


On Fri, 5 Jan 2024 15:28:41 +0100
Eric Dumazet <edumazet@xxxxxxxxxx> wrote:

> On Fri, Jan 5, 2024 at 3:26 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> >
> > On Fri, Jan 5, 2024 at 2:27 PM Petr Tesařík <petr@xxxxxxxxxxx> wrote:
> > >
> > > Hi Eric,
> > >
> > > yeah, it's me again...
> > >
> > > On Fri, 5 Jan 2024 12:14:47 +0100
> > > Petr Tesařík <petr@xxxxxxxxxxx> wrote:
> > >
> > > > On Fri, 5 Jan 2024 11:48:19 +0100
> > > > Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> > > >
> > > > > On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@xxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri, 5 Jan 2024 10:58:42 +0100
> > > > > > Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@xxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > > > > > >
> > > > > > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > > > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > > > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > > > > > >
> > > > > > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > > > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > > > > > >
> > > > > > > > Signed-off-by: Petr Tesarik <petr@xxxxxxxxxxx>
> > > > > > >
> > > > > > > This is going to add more costs to 64bit platforms ?
> > > > > >
> > > > > > Yes, it adds a (hopefully not too contended) spinlock and in most
> > > > > > places an interrupt disable/enable pair.
> > > > > >
> > > > > > FWIW the race condition is also present on 64-bit platforms, resulting
> > > > > > in inaccurate statistic counters. I can understand if you consider it a
> > > > > > mild annoyance, not worth fixing.
> > > > > >
> > > > > > > It seems to me that the same syncp can be used from two different
> > > > > > > threads : hard irq and napi poller...
> > > > > >
> > > > > > Yes, that's exactly the scenario that locks up my system.
> > > > > >
> > > > > > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > > > > > decide to go for a spinlock...
> > > > > >
> > > > > > The spinlock does not havce to be taken on the reader side, so the
> > > > > > seqcounter still adds some value.
> > > > > >
> > > > > > > Alternative would use atomic64_t fields for the ones where there is no
> > > > > > > mutual exclusion.
> > > > > > >
> > > > > > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > > > > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > > > > > non LLTX drivers)
> > > > > > >
> > > > > > > This leaves the fields updated from hardware interrupt context ?
> > > > > >
> > > > > > I'm afraid I don't have enough network-stack-foo to follow here.
> > > > > >
> > > > > > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > > > > > process context while another core runs the TX napi on the same channel
> > > > > > (in interrupt context). I didn't observe any race on the RX path, but I
> > > > > > believe it's possible with NAPI busy polling.
> > > > > >
> > > > > > In any case, I don't see the connection with LLTX. Maybe you want to
> > > > > > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > > > > > driver), but might not be safe for LLTX drivers?
> > > > >
> > > > > LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> > > > > running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> > > > > would be a bug.
> > > > > These drivers usually use per-cpu stats, to avoid races and false
> > > > > sharing anyway.
> > > > >
> > > > > I think you should split the structures into two separate groups, each
> > > > > guarded with its own syncp.
> > > > >
> > > > > No extra spinlocks, no extra costs on 64bit arches...
> > > > >
> > > > > If TX completion can run in parallel with ndo_start_xmit(), then
> > > > > clearly we have to split stmmac_txq_stats in two halves:
> > > >
> > > > Oh, now I get it. Yes, that's much better, indeed.
> > > >
> > > > I mean, the counters have never been consistent (due to the race on the
> > > > writer side), and nobody is concerned. So, there is no value in taking
> > > > a consistent snapshot in stmmac_get_ethtool_stats().
> > > >
> > > > I'm going to rework and retest my patch. Thank you for pointing me in
> > > > the right direction!
> > > >
> > > > Petr T
> > > >
> > > > > Also please note the conversion from u64 to u64_stats_t
> > > >
> > > > Noted. IIUC this will in turn close the update race on 64-bit by using
> > > > an atomic type and on 32-bit by using a seqlock. Clever.
> > > >
> > > > Petr T
> > > >
> > > > > Very partial patch, only to show the split and new structure :
> > > > >
> > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> > > > > 100644
> > > > > --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > @@ -60,16 +60,22 @@
> > > > > /* #define FRAME_FILTER_DEBUG */
> > > > >
> > > > > struct stmmac_txq_stats {
> > > > > - u64 tx_bytes;
> > > > > - u64 tx_packets;
> > > > > - u64 tx_pkt_n;
> > > > > - u64 tx_normal_irq_n;
> > > > > - u64 napi_poll;
> > > > > - u64 tx_clean;
> > > > > - u64 tx_set_ic_bit;
> > > > > - u64 tx_tso_frames;
> > > > > - u64 tx_tso_nfrags;
> > > > > - struct u64_stats_sync syncp;
> > > > > +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> > > > > + struct u64_stats_sync syncp_tx;
> > > > > + u64_stats_t tx_bytes;
> > > > > + u64_stats_t tx_packets;
> > > > > + u64_stats_t tx_pkt_n;
> > > > > + u64_stats_t tx_tso_frames;
> > > > > + u64_stats_t tx_tso_nfrags;
> > > > > +
> > > > > +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> > > > > + struct u64_stats_sync syncp_tx_completion;
> > > > > + u64_stats_t napi_poll;
> > > > > + u64_stats_t tx_clean;
> > > > > + u64_stats_t tx_set_ic_bit;
> > >
> > > Unfortunately, this field is also updated from ndo_start_xmit():
> > >
> > > 4572) if (set_ic)
> > > 4573) txq_stats->tx_set_ic_bit++;
> > >
> > > I feel it would be a shame to introduce a spinlock just for this one
> > > update. But I think the field could be converted to an atomic64_t.
> > >
> > > Which raises a question: Why aren't all stat counters simply atomic64_t? There
> > > is no guarantee that the reader side takes a consistent snapshot
> > > (except on 32-bit). So, why do we even bother with u64_stats_sync?
> >
> > This infra was added to have no _lock_ overhead on 64bit arches.
> >
> > If a counter must be updated from multiple cpus (regardless of 32/64
> > bit kernel), then use atomic64_t
>
> Or use two different u64_stats_t (each guarded by a different syncp),
> then fold them at stats gathering time.

Oh, right, why didn't I think about it!

This only leaves an atomic_t in hard irq context. I have tried to find
something that could relax the requirement, but AFAICS at least some
setups use several interrupts that can be delivered to different CPUs
simultaneously, and all of them will walk over all channels. So we're
left with an atomic_t here.

> > > Is it merely because u64_stats_add() should be cheaper than
> > > atomic64_add()? Or is there anything else I'm missing? If yes,
> > > does it invalidate my proposal to convert tx_set_ic_bit to an
> > > atomic64_t?
> >
> > atomic64_add() is expensive, especially in contexts where updates
> > are already guarded by a spinlock or something.

Yeah, I know atomics can be expensive. I've heard of cases where an Arm
core was spinning on an atomic operation for several seconds...

Petr T