Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Lobakin
Date: Fri Jun 02 2023 - 10:00:10 EST


From: Alexander Duyck <alexander.duyck@xxxxxxxxx>
Date: Wed, 31 May 2023 10:22:18 -0700

> On Wed, May 31, 2023 at 8:14 AM Alexander Lobakin
> <aleksander.lobakin@xxxxxxxxx> wrote:

[...]

>> But not all of these variables are read-only. E.g. NTC is often
>> modified. Page size was calculated per descriptor, but could be once a
>> poll cycle starts, and so on.
>
> Yeah, the ntc should be carried in the stack. The only reason for
> using the ring variable was because in the case of ixgbe we had to do
> some tricks with it to deal with RSC as we were either accessing ntc
> or the buffer pointed to by the descriptor. I think most of that code
> has been removed for i40e though.

IAVF was forked off ixgbe as per Jesse's statement :D

[...]

>>> Any specific reason for this? Just wondering if this is meant to
>>> address some sort of memory pressure issue since it basically just
>>> means the allocation can go out and try to free other memory.
>>
>> Yes, I'm no MM expert, but I've seen plenty of times messages from the
>> MM folks that ATOMIC shouldn't be used in non-atomic contexts. Atomic
>> allocation is able to grab memory from some sort of critical reservs and
>> all that, and the less we touch them, the better. Outside of atomic
>> contexts they should not be touched.
>
> For our purposes though the Rx path is more-or-less always in
> interrupt context. That is why it had defaulted to just always using
> GFP_ATOMIC. For your purposes you could probably leave it that way
> since you are going to be pulling out most of this code anyway.

That's for Rx path, but don't forget that the initial allocation on ifup
is done in the process context. That's what the maintainers and
reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.

[...]

>> The point of budget is to limit the amount of time drivers can spend on
>> cleaning their rings. Making skb the unit makes the unit very logical
>> and flexible, but I'd say it should always be solid. Imagine you get a
>> frame which got spanned across 5 buffers. You spend x5 time (roughly) to
>> build an skb and pass it up the stack vs when you get a linear frame in
>> one buffer, but according to your logics both of these cases count as 1
>> unit, while the amount of time spent differs significantly. I can't say
>> that's fair enough.
>
> I would say it is. Like I said most of the overhead is the stack, not
> the driver. So if we are cleaning 5 descriptors but only processing
> one skb then I would say it is only one unit in terms of budget. This
> is one of the reasons why we don't charge Tx to the NAPI budget. Tx
> clean up is extremely lightweight as it is only freeing memory, and in
> cases of Tx and Rx being mixed can essentially be folded in as Tx
> buffers could be reused for Rx.
>
> If we are wanting to increase the work being done per poll it would
> make more sense to stick to interrupts and force it to backlog more
> packets per interrupt so that it is processing 64 skbs per call.

Oh, I feel like I'm starting to agree :D OK, then the following doesn't
really get out of my head: why do we store skb pointer on the ring then,
if we count 1 skb as 1 unit, so that we won't leave the loop until the
EOP? Only to handle allocation failures? But skb is already allocated at
this point... <confused>

[...]

>>> What is the test you saw the 2% performance improvement in? Is it
>>> something XDP related or a full stack test?
>>
>> Not XDP, it's not present in this driver at this point :D
>> Stack test, but without usercopy overhead. Trafgen bombs the NIC, the
>> driver builds skbs and passes it up the stack, the stack does GRO etc,
>> and then the frames get dropped on IP input because there's no socket.
>
> So one thing you might want to look at would be a full stack test w/
> something such as netperf versus optimizing for a drop only test.
> Otherwise that can lead to optimizations that will actually hurt
> driver performance in the long run.

I was doing some netperf (or that Microsoft's tool, don't remember the
name) tests, but the problem is that usercopy is such a bottleneck, so
that you don't notice any optimizations or regressions most of time.
Also, userspace tools usually just pass huge payload chunks and then the
drivers GSO them into MTU-sized frames, so you always get line rate and
that's it. Short frames or interleave/imix (randomly-mix-sized) are the
most stressful from my experience and are able to show actual outcome.

>
>>>
>>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
>>>
>>> Also one thing I am not a huge fan of is a patch that is really a
>>> patchset onto itself. With all 6 items called out here I would have
>>> preferred to see this as 6 patches as it would have been easier to
>>> review.
>>
>> Agree BTW, I'm not a fan of this patch either. I wasn't sure what to do
>> with it, as splitting it into 6 explodes the series into a monster, but
>> proceeding without it increases diffstat and complicates things later
>> on. I'll try the latter, but will see. 17 patches is not the End of Days
>> after all.
>
> One thing you may want to consider to condense some of these patches
> would be to look at possibly combining patches 4 and 5 which disable
> recycling and use a full 4K page. It seems like of those patches one
> ends up redoing the other since so many of the dma_sync calls are
> updated in both.

Or maybe I'll move this one into the subsequent series, since it's only
pt. 1 of Rx optimizations. There's also the second commit, but it's
probably as messy as this one and these two could be just converted into
a series.

[...]

>>> Just a nit. You might want to break this up into two statements like I
>>> had before. I know some people within Intel weren't a huge fan of when
>>> I used to do that kind of thing all the time in loops where I would do
>>> the decrement and test in one line.. :)
>>
>> Should I please them or do it as I want to? :D I realize from the
>> compiler's PoV it's most likely the same, but dunno, why not.
>
> If nobody internally is bugging you about it then I am fine with it. I
> just know back during my era people would complain about that from a
> maintainability perspective. I guess I got trained to catch those kind
> of things as a result.

Haha understand. I usually say: "please some good arguments or I didn't
hear this", maybe that's why nobody complained on `--var` yet :D

[...]

>> Yes, I'm optimizing all this out later in the series. I was surprised
>> just as much as you when I saw skb getting passed to do nothing ._.
>
> The funny part for me is that it is like reviewing code written via a
> game of telephone. I recognize the code but have to think about it
> since there are all the bits of changes and such from the original
> ixgbe.

Lots of things are still recognizable even in IDPF. That's how this
series was born... :D

>
>> [...]
>>
>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
>
> No problem, it was the least I could do since I am responsible for so
> much of this code in the earlier drivers anyway. If nothing else I
> figured I could provide a bit of history on why some of this was the
> way it was.
These history bits are nice and interesting to read actually! And also
useful since they give some context and understanding of what is
obsolete and can be removed/changed.

Thanks,
Olek