Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

From: Kal Cutter Conley
Date: Wed Apr 26 2023 - 08:57:35 EST

Next message: Andrew Jones: "Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf"
Previous message: Greg Kroah-Hartman: "Re: [PATCH v2 0/7] Allow dynamic allocation of software IO TLB bounce buffers"
In reply to: John Fastabend: "Re: [PATCH] xsk: Use pool->dma_pages to check for DMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> > > Was it noticable in some sort of performance test?
> >
> > This patch is part of the patchset found at
> > https://lore.kernel.org/all/20230412162114.19389-3-kal.conley@xxxxxxxxxxx/
> > which is being actively discussed and needs to be resubmitted anyway
> > because of a conflict. While the discussion continues, I am submitting
> > this patch by itself because I think it's an improvement on its own
> > (regardless of what happens with the rest of the linked patchset). On
> > one system, I measured a performance regression of 2-3% with xdpsock
> > and the linked changes without the current patch. With the current
> > patch, the performance regression was no longer observed.
>
> Would be nice to have in commit message so reader has an idea the
> perf numbers are in fact better.

When I measured this patch by itself (on bpf-next), I didn't measure
any statistically significant performance gains. However, it did allow
me to avoid a regression when combined with the other linked patch (as
mentioned). I don't know if it makes sense to mention that other
change which is not even applied to any tree. I was mainly submitting
this patch from the perspective of the code being better not
contingent on any provable performance gains.

>
> >
> > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > index d318c769b445..a8d7b8a3688a 100644
> > > > --- a/include/net/xsk_buff_pool.h
> > > > +++ b/include/net/xsk_buff_pool.h
> > > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > > if (likely(!cross_pg))
> > > > return false;
> > > >
> > > > - return pool->dma_pages_cnt &&
> > > > + return pool->dma_pages &&
> > > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > > }
> >
> > I would consider the above code part of the "fast path". It may be
> > executed approximately once per frame in unaligned mode.
>
> In the unlikely case though is my reading. So really shouldn't
> be called for every packet or we have other perf issues by that
> likely() there.
>
> I assume the above is where the perf is being gained because below
> two things are in setup/tear down. But then we are benchmarking
> an unlikely() path?

I was testing with large chunk sizes in unaligned mode (4000-4096
bytes) with ZC. For chunk sizes nearly as large as PAGE_SIZE the
unlikely path is actually the main path.

> >
> > > This seems to be used in the setup/tear-down paths so your optimizing
> > > a control side. Is there a fast path with this code? I walked the
> > > ice driver. If its just setup code we should do whatever is more
> > > readable.
> >
> > It is not only used in setup/tear-down paths (see above).
> > Additionally, I believe the code is also _more_ readable with this
> > patch applied. In particular, this patch reduces cognitive complexity
> > since people (and compilers) reading the code don't need to
> > additionally think about pool->dma_pages_cnt.
> >
> > > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > > This is so deep into micro-optimizing I'm curious if you could measure it?
> >
> > It is saving a load which also reduces code size. This will affect
> > other decisions such as what to inline. Also in the linked patchset,
> > dma_pages and dma_pages_cnt do not share a cache line (on x86_64).
>
> But again buried in an unlikely path. Sure but removing the conditional
> altogether would be even better.

Yeah, I think that is another improvement to consider.

> So my understanding is ZC is preferred and default mode and copy modes
> are primarily fall back modes. So we are punishing the good case here
> for a fallback to copy mode. I think overall refactoring the code to
> avoid burdoning the fast case with a fallback slow case would be ideal
> solution.

I agree that ZC is preferred and this patch is aimed at improving the
ZC path. The performance gain I observed was for ZC.

> However, I agree just on readability the patch is fine and good. No
> objection on my side. But I think if we are making performance
> arguments for 2-3% here the better thing to do is remove the check
> and unlikely() and we would see better benchmarks when using the
> ZC mode which as I understand it is what performance aware folks should
> be doing.

I totally agree that other better improvements exist but I don't think
they make this patch any less desirable. This change is only meant as
a small incremental improvement.

Next message: Andrew Jones: "Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf"
Previous message: Greg Kroah-Hartman: "Re: [PATCH v2 0/7] Allow dynamic allocation of software IO TLB bounce buffers"
In reply to: John Fastabend: "Re: [PATCH] xsk: Use pool->dma_pages to check for DMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]