Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

From: Kal Cutter Conley
Date: Tue Apr 25 2023 - 06:38:09 EST


> Okay, 2-3% but with what settings? rxdrop for unaligned mode? what chunk
> size etc? We need this kind of info, "compiles to more efficient code"
> from original commit message is too generic and speculative to me. 2-3% of
> perf diff against specific xdpsock setup is real improvement and is a
> strong argument for getting this patch as-is, by its own.

I don't have the exact numbers anymore. I measured a performance
difference (up to 2-3%) with different settings including rxdrop and
unaligned mode. The exact settings you have in the commit message from
the linked patch. I didn't go into details in the commit message
because I thought this change would be a slam dunk. I don't think
there is any reason to believe the code is slower with this patch. If
anything, it should generally be faster. At the very least it will
lead to more efficient code in terms of size since dma_pages_cnt is no
longer read. Also I think the code is more readable with this patch.

>
> >
> > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > index d318c769b445..a8d7b8a3688a 100644
> > > > --- a/include/net/xsk_buff_pool.h
> > > > +++ b/include/net/xsk_buff_pool.h
> > > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > > if (likely(!cross_pg))
> > > > return false;
> > > >
> > > > - return pool->dma_pages_cnt &&
> > > > + return pool->dma_pages &&
> > > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > > }
> >
> > I would consider the above code part of the "fast path". It may be
> > executed approximately once per frame in unaligned mode.
> >
> > > This seems to be used in the setup/tear-down paths so your optimizing
> > > a control side. Is there a fast path with this code? I walked the
> > > ice driver. If its just setup code we should do whatever is more
> > > readable.
> >
> > It is not only used in setup/tear-down paths (see above).
> > Additionally, I believe the code is also _more_ readable with this
> > patch applied. In particular, this patch reduces cognitive complexity
> > since people (and compilers) reading the code don't need to
> > additionally think about pool->dma_pages_cnt.
>
> John was referring to xp_dma_unmap() with the comment above which indeed
> is a teardown path, so probably this doesn't matter from performance
> perspective and you could avoid this chunk from your patch.

The setup/tear-down lines were also changed to keep the code
consistent. It doesn't make sense to sometimes check dma_pages_cnt and
other times dma_pages.

>
> >
> > > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > > This is so deep into micro-optimizing I'm curious if you could measure it?
> >
> > It is saving a load which also reduces code size. This will affect
> > other decisions such as what to inline. Also in the linked patchset,
> > dma_pages and dma_pages_cnt do not share a cache line (on x86_64).
>
> Yes I believe that the with your patch on unaligned mode by touching the
> dma_pages you're warming the relevant cache line for your setup.

dma_pages is touched anyway right after. That is the point of this
patch: since dma_pages already needs to be loaded into a register,
just check that instead of loading an additional field possibly from a
different cache line.