RE: [PATCH net-next] stmmac: align RX buffers

From: Joakim Zhang
Date: Wed Aug 11 2021 - 06:56:46 EST



> -----Original Message-----
> From: Thierry Reding <thierry.reding@xxxxxxxxx>
> Sent: 2021年8月11日 18:42
> To: Marc Zyngier <maz@xxxxxxxxxx>
> Cc: Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linux-riscv@xxxxxxxxxxxxxxxxxxx; Giuseppe
> Cavallaro <peppe.cavallaro@xxxxxx>; Alexandre Torgue
> <alexandre.torgue@xxxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>;
> Jakub Kicinski <kuba@xxxxxxxxxx>; Palmer Dabbelt <palmer@xxxxxxxxxxx>;
> Paul Walmsley <paul.walmsley@xxxxxxxxxx>; Drew Fustini
> <drew@xxxxxxxxxxxxxxx>; Emil Renner Berthing <kernel@xxxxxxxx>; Jon
> Hunter <jonathanh@xxxxxxxxxx>; Will Deacon <will@xxxxxxxxxx>
> Subject: Re: [PATCH net-next] stmmac: align RX buffers
>
> On Tue, Aug 10, 2021 at 08:07:47PM +0100, Marc Zyngier wrote:
> > Hi all,
> >
> > [adding Thierry, Jon and Will to the fun]
> >
> > On Mon, 14 Jun 2021 03:25:04 +0100,
> > Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > From: Matteo Croce <mcroce@xxxxxxxxxxxxx>
> > >
> > > On RX an SKB is allocated and the received buffer is copied into it.
> > > But on some architectures, the memcpy() needs the source and
> > > destination buffers to have the same alignment to be efficient.
> > >
> > > This is not our case, because SKB data pointer is misaligned by two
> > > bytes to compensate the ethernet header.
> > >
> > > Align the RX buffer the same way as the SKB one, so the copy is faster.
> > > An iperf3 RX test gives a decent improvement on a RISC-V machine:
> > >
> > > before:
> > > [ ID] Interval Transfer Bitrate Retr
> > > [ 5] 0.00-10.00 sec 733 MBytes 615 Mbits/sec 88
> sender
> > > [ 5] 0.00-10.01 sec 730 MBytes 612 Mbits/sec
> receiver
> > >
> > > after:
> > > [ ID] Interval Transfer Bitrate Retr
> > > [ 5] 0.00-10.00 sec 1.10 GBytes 942 Mbits/sec 0
> sender
> > > [ 5] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec
> receiver
> > >
> > > And the memcpy() overhead during the RX drops dramatically.
> > >
> > > before:
> > > Overhead Shared O Symbol
> > > 43.35% [kernel] [k] memcpy
> > > 33.77% [kernel] [k] __asm_copy_to_user
> > > 3.64% [kernel] [k] sifive_l2_flush64_range
> > >
> > > after:
> > > Overhead Shared O Symbol
> > > 45.40% [kernel] [k] __asm_copy_to_user
> > > 28.09% [kernel] [k] memcpy
> > > 4.27% [kernel] [k] sifive_l2_flush64_range
> > >
> > > Signed-off-by: Matteo Croce <mcroce@xxxxxxxxxxxxx>
> >
> > This patch completely breaks my Jetson TX2 system, composed of 2
> > Nvidia Denver and 4 Cortex-A57, in a very "funny" way.
> >
> > Any significant amount of traffic result in all sort of corruption
> > (ssh connections get dropped, Debian packages downloaded have the
> > wrong checksums) if any Denver core is involved in any significant way
> > (packet processing, interrupt handling). And it is all triggered by
> > this very change.
> >
> > The only way I have to make it work on a Denver core is to route the
> > interrupt to that particular core and taskset the workload to it. Any
> > other configuration involving a Denver CPU results in some sort of
> > corruption. On their own, the A57s are fine.
> >
> > This smells of memory ordering going really wrong, which this change
> > would expose. I haven't had a chance to dig into the driver yet (it
> > took me long enough to bisect it), but if someone points me at what is
> > supposed to synchronise the DMA when receiving an interrupt, I'll have
> > a look.
>
> One other thing that kind of rings a bell when reading DMA and interrupts is a
> recent report (and attempt to fix this) where upon resume from system
> suspend, the DMA descriptors would get corrupted.
>
> I don't think we ever figured out what exactly the problem was, but
> interestingly the fix for the issue immediately caused things to go haywire on...
> Jetson TX2.
>
> I recall looking at this a bit and couldn't find where exactly the DMA was being
> synchronized on suspend/resume, or what the mechanism was to ensure that
> (in transit) packets were not received after the suspension of the Ethernet
> device. Some information about this can be found here:
>
> https://lore.kernel.org/netdev/708edb92-a5df-ecc4-3126-5ab36707e275
> @nvidia.com/
>
> It's interesting that this happens only on Jetson TX2. Apparently on the newer
> Jetson AGX Xavier this problem does not occur. I think Jon also narrowed this
> down to being related to the IOMMU being enabled on Jetson TX2, whereas
> Jetson AGX Xavier didn't have it enabled. I wasn't able to find any notes on
> whether disabling the IOMMU on Jetson TX2 did anything to improve on this,
> so perhaps that's something worth trying.
>
> We have since enabled the IOMMU on Jetson AGX Xavier, and I haven't seen
> any test reports indicating that this is causing issues. So I don't think this has
> anything directly to do with the IOMMU support.
>
> That said, if these problems are all exclusive to Jetson TX2, or rather Tegra186,
> that could indicate that we're missing something at a more fundamental level
> (maybe some cache maintenance quirk?).


Hey Thierry,

Please also notice me if you found the root cause, that would be appreciated!
I have not upstream the fix you mentioned yet since your continuous NACK.

Thanks in advance 😊

Best Regards,
Joakim Zhang
> Thierry
>
> > > ---
> > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> > > b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> > > index b6cd43eda7ac..04bdb3950d63 100644
> > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> > > @@ -338,9 +338,9 @@ static inline bool stmmac_xdp_is_enabled(struct
> > > stmmac_priv *priv) static inline unsigned int
> > > stmmac_rx_offset(struct stmmac_priv *priv) {
> > > if (stmmac_xdp_is_enabled(priv))
> > > - return XDP_PACKET_HEADROOM;
> > > + return XDP_PACKET_HEADROOM + NET_IP_ALIGN;
> > >
> > > - return 0;
> > > + return NET_SKB_PAD + NET_IP_ALIGN;
> > > }
> > >
> > > void stmmac_disable_rx_queue(struct stmmac_priv *priv, u32 queue);
> > > --
> > > 2.31.1
> > >
> > >
> >
> > --
> > Without deviation from the norm, progress is not possible.