Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Duyck
Date: Fri Jun 02 2023 - 13:50:44 EST


On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
<aleksander.lobakin@xxxxxxxxx> wrote:
>
> From: Alexander Duyck <alexander.duyck@xxxxxxxxx>
> Date: Fri, 2 Jun 2023 08:04:53 -0700
>
> > On Fri, Jun 2, 2023 at 7:00 AM Alexander Lobakin
> > <aleksander.lobakin@xxxxxxxxx> wrote:
>
> [...]
>
> >> That's for Rx path, but don't forget that the initial allocation on ifup
> >> is done in the process context. That's what the maintainers and
> >> reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.
> >
> > I can see that for the static values like the queue vectors and rings,
> > however for the buffers themselves, but I don't see the point in doing
> > that for the regular buffer allocations. Basically it is adding
> > overhead for something that should have minimal impact as it usually
> > happens early on during boot when the memory should be free anyway so
> > GFP_ATOMIC vs GFP_KERNEL wouldn't have much impact in either case
>
> Queue vectors and rings get allocated earlier than buffers, on device
> probing :D ifup happens later and it depends on the networking scripts
> etc. -- now every init system enables all the interfaces when booting up
> like systemd does. Plus, ifdowns-ifups can occur during the normal
> system functioning -- resets, XDP setup/remove, Ethtool configuration,
> and so on. I wouldn't say Rx buffer allocation happens only on early boot.

I agree it isn't only on early boot, but that is the most common case.
The rings and such tend to get allocated early and unless there is a
need for some unforeseen change the rings typically are not modified.

I just don't see the point of special casing the allocations since if
they fail we will be turning around and just immediately calling the
GFP_ATOMIC version within 2 seconds anyway to try and fill out the
empty rings.

> >> Oh, I feel like I'm starting to agree :D OK, then the following doesn't
> >> really get out of my head: why do we store skb pointer on the ring then,
> >> if we count 1 skb as 1 unit, so that we won't leave the loop until the
> >> EOP? Only to handle allocation failures? But skb is already allocated at
> >> this point... <confused>
> >
> > The skb is there to essentially hold the frags. Keep in mind that when
> > ixgbe was coded up XDP didn't exist yet.
>
> Ok, maybe I phrased it badly.
> If we don't stop the loop until skb is passed up the stack, how we can
> go out of the loop with an unfinished skb? Previously, I thought lots of
> drivers do that, as you may exhaust your budget prior to reaching the
> last fragment, so you'll get back to the skb on the next poll.
> But if we count 1 skb as budget unit, not descriptor, how we can end up
> breaking the loop prior to finishing the skb? I can imagine only one
> situation: HW gave us some buffers, but still processes the EOP buffer,
> so we don't have any more descriptors to process, but the skb is still
> unfinished. But sounds weird TBH, I thought HW processes frames
> "atomically", i.e. it doesn't give you buffers until they hold the whole
> frame :D

The problem is the frames aren't necessarily written back atomically.
One big issue is descriptor write back. The hardware will try to cache
line optimize things in order to improve performance. It is possible
for a single frame to straddle either side of a cache line. As a
result the first half may be written back, the driver then processes
that cache line, and finds the next one isn't populated while the
hardware is collecting enough descriptors to write back the next one.

It is also one of the reasons why I went to so much effort to prevent
us from writing to the descriptor ring in the cleanup paths. You never
know when you might be processing an earlier frame and accidently
wander into a section that is in the process of being written. I think
that is addressed now mostly through the use of completion queues
instead of the single ring that used to process both work and
completions.

> >
> > I think there are drivers that are already getting away from this,
> > such as mvneta, by storing an xdp_buff instead of an skb. In theory we
> > could do away with most of this and just use a shared_info structure,
> > but since that exists in the first frag we still need a pointer to the
> > first frag as well.
>
> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
> than skb, but also carries the frags, since frags is a part of shinfo,
> not skb.
> It's totally fine and we'll end up doing the same here, my question was
> as I explained below.

Okay. I haven't looked at ice that closely so I wasn't aware of that.

> >
> > Also multi-frag frames are typically not that likely on a normal
> > network as most of the frames are less than 1514B in length. In
> > addition as I mentioned before a jumbo frame workload will be less
> > demanding since the frame rates are so much lower. So when I coded
> > this up I had optimized for the non-fragged case with the fragmented
> > case being more of an afterthought needed mostly as exception
> > handling.
>
> [...]
>
> > That is kind of what I figured. So one thing to watch out for is
> > stating performance improvements without providing context on what
> > exactly it is you are doing to see that gain. So essentially what we
> > have is a microbenchmark that is seeing the gain.
> >
> > Admittedly my goto used to be IPv4 routing since that exercised both
> > the Tx and Rx path for much the same reason. However one thing you
> > need to keep in mind is that if you cannot see a gain in the
> > full-stack test odds are most users may not notice much of an impact.
>
> Yeah sure. I think more than a half of optimizations in such drivers
> nowadays is unnoticeable to end users :D
>
> [...]
>
> > Either that or they were already worn down by the time you started
> > adding this type of stuff.. :)
> >
> > The one I used to do that would really drive people nuts was:
> > for (i = loop_count; i--;)
>
> Oh, nice one! Never thought of something like that hehe.
>
> >
> > It is more efficient since I don't have to do the comparison to the
> > loop counter, but it is definitely counterintuitive to run loops
> > backwards like that. I tried to break myself of the habit of using
> > those sort of loops anywhere that wasn't performance critical such as
> > driver init.
>
> [...]
>
> > Yep, now the question is how many drivers can be pulled into using
> > this library. The issue is going to be all the extra features and
> > workarounds outside of your basic Tx/Rx will complicate the code since
> > all the drivers implement them a bit differently. One of the reasons
> > for not consolidating them was to allow for performance optimizing for
> > each driver. By combining them you are going to likely need to add a
> > number of new conditional paths to the fast path.
>
> When I was counting the number of spots in the Rx polling function that
> need to have switch-cases/ifs in order to be able to merge the code
> (e.g. parsing the descriptors), it was something around 4-5 (per
> packet). So it can only be figured out during the testing whether adding
> new branches actually hurts there.

The other thing is you may want to double check CPU(s) you are
expected to support as last I knew switch statements were still
expensive due to all the old spectre/meltdown workarounds.

> XDP is relatively easy to unify in one place, most of code is
> software-only. Ring structures are also easy to merge, wrapping a couple
> driver-specific pointers into static inline accessors. Hotpath in
> general is the easiest part, that's why I started from it.
>
> But anyway, I'd say if one day I'd have to choice whether to remove 400
> locs per driver with having -1% in synthetic tests or not do that at
> all, I'd go for the former. As discussed above, it's very unlikely for
> such changes to hurt real workloads, esp. with usercopy.

+1 assuming no serious regressions.

> >
> >
> >>>
> >>>> [...]
> >>>>
> >>>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
> >>>
> >>> No problem, it was the least I could do since I am responsible for so
> >>> much of this code in the earlier drivers anyway. If nothing else I
> >>> figured I could provide a bit of history on why some of this was the
> >>> way it was.
> >> These history bits are nice and interesting to read actually! And also
> >> useful since they give some context and understanding of what is
> >> obsolete and can be removed/changed.
> >
> > Yeah, it is easiest to do these sort of refactors when you have
> > somebody to answer the "why" of most of this. I recall going through
> > this when I was refactoring the igb/ixgbe drivers back in the day and
> > having to purge the dead e1000 code throughout. Of course, after this
> > refactor it will be all yours right?.. :D
>
> Nope, maybe from the git-blame PoV only :p
>
> Thanks,
> Olek