Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

From: Kal Cutter Conley
Date: Mon Apr 24 2023 - 18:53:21 EST


> > Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> > active DMA mapping. pool->dma_pages needs to be read anyway to access
> > the map so this compiles to more efficient code.
>
> Was it noticable in some sort of performance test?

This patch is part of the patchset found at
https://lore.kernel.org/all/20230412162114.19389-3-kal.conley@xxxxxxxxxxx/
which is being actively discussed and needs to be resubmitted anyway
because of a conflict. While the discussion continues, I am submitting
this patch by itself because I think it's an improvement on its own
(regardless of what happens with the rest of the linked patchset). On
one system, I measured a performance regression of 2-3% with xdpsock
and the linked changes without the current patch. With the current
patch, the performance regression was no longer observed.

> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index d318c769b445..a8d7b8a3688a 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
> > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > if (likely(!cross_pg))
> > return false;
> >
> > - return pool->dma_pages_cnt &&
> > + return pool->dma_pages &&
> > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > }

I would consider the above code part of the "fast path". It may be
executed approximately once per frame in unaligned mode.

> This seems to be used in the setup/tear-down paths so your optimizing
> a control side. Is there a fast path with this code? I walked the
> ice driver. If its just setup code we should do whatever is more
> readable.

It is not only used in setup/tear-down paths (see above).
Additionally, I believe the code is also _more_ readable with this
patch applied. In particular, this patch reduces cognitive complexity
since people (and compilers) reading the code don't need to
additionally think about pool->dma_pages_cnt.

> Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> This is so deep into micro-optimizing I'm curious if you could measure it?

It is saving a load which also reduces code size. This will affect
other decisions such as what to inline. Also in the linked patchset,
dma_pages and dma_pages_cnt do not share a cache line (on x86_64).

>
> > } else {
> > xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
>
> I'm not actually against optimizing but maybe another idea. Why do we have to
> check at all? Seems if the DMA has been disabled/unmapped the driver shouldn't
> be trying to call xsk_buff_alloc_batch? Then you can just drop the 'if' check.
>
> It feels to me the drivers shouldn't even be calling this after unmapping
> the dma. WDYT?

Many of these code paths are used both for ZC and copy modes. You
might be right that this particular case is only used with DMA.