Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support

From: Jesper Dangaard Brouer
Date: Tue Oct 04 2022 - 07:22:02 EST



On 03/10/2022 14.49, Shenwei Wang wrote:
Hi Jesper,

On mvneta driver/platform we saw huge speedup replacing:

page_pool_release_page(rxq->page_pool, page); with
skb_mark_for_recycle(skb);


After replacing the page_pool_release_page with the
skb_mark_for_recycle, I found something confused me a little in the
testing result. >
I tested with the sample app of "xdpsock" under two modes: 1. Native (xdpsock -i eth0). 2. Skb-mode (xdpsock -S -i eth0).
Great that you are also testing AF_XDP, but do you have a particular
use-case that needs AF_XDP on this board?

What packet size are used in below results?

The following are the testing result:
>
With page_pool_release_page (pps) With skb_mark_for_recycle (pps)

SKB-Mode 90K 200K
Native 190K 190K


The default AF_XDP test with xdpsock is rxdrop IIRC.

Can you test the normal XDP code path and do a XDP_DROP test via the
samples tool 'xdp_rxq_info' and cmdline:

sudo ./xdp_rxq_info --dev eth42 --act XDP_DROP --read

And then same with --skb-mode

The skb_mark_for_recycle solution boosted the performance of SKB-Mode
to 200K+ PPS. That is even higher than the performance of Native
solution. Is this result reasonable? Do you have any clue why the
SKB-Mode performance can go higher than that of Native one?
I might be able to explain this (Cc. AF_XDP maintainers to keep me honest).

When you say "native" *AF_XDP* that isn't Zero-Copy AF_XDP.

Sure, XDP runs in native driver mode and redirects the raw frames into the AF_XDP socket, but as this isn't zero-copy AF_XDP. Thus, the packets needs to be copied into the AF_XDP buffers.

As soon as the frame or SKB (for generic XDP) have been copied it is released/freed by AF_XDP/xsk code (either via xdp_return_buff() or consume_skb()). Thus, it looks like it really pays off to recycle the frame via page_pool, also for the SKB consume_skb() case.

I am still a little surprised that to can be faster than native AF_XDP, as the SKB-mode ("XDP-generic") needs to call through lot more software layers and convert the SKB to look like an xdp_buff.

--Jesper



-----Original Message-----
From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
Sent: Thursday, September 29, 2022 1:55 PM
To: Shenwei Wang <shenwei.wang@xxxxxxx>; Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx>; Andrew Lunn <andrew@xxxxxxx>
Cc: brouer@xxxxxxxxxx; Joakim Zhang <qiangqing.zhang@xxxxxxx>; David S.
Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub
Kicinski <kuba@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; Alexei
Starovoitov <ast@xxxxxxxxxx>; Daniel Borkmann <daniel@xxxxxxxxxxxxx>;
Jesper Dangaard Brouer <hawk@xxxxxxxxxx>; John Fastabend
<john.fastabend@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
kernel@xxxxxxxxxxxxxxx; imx@xxxxxxxxxxxxxxx
Subject: Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support

Caution: EXT Email

On 29/09/2022 17.52, Shenwei Wang wrote:

From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>

On 29/09/2022 15.26, Shenwei Wang wrote:

From: Andrew Lunn <andrew@xxxxxxx>
Sent: Thursday, September 29, 2022 8:23 AM
[...]

I actually did some compare testing regarding the page pool for
normal traffic. So far I don't see significant improvement in the
current implementation. The performance for large packets improves
a little, and the performance for small packets get a little worse.

What hardware was this for? imx51? imx6? imx7 Vybrid? These all use the
FEC.

I tested on imx8qxp platform. It is ARM64.

On mvneta driver/platform we saw huge speedup replacing:

page_pool_release_page(rxq->page_pool, page); with
skb_mark_for_recycle(skb);

As I mentioned: Today page_pool have SKB recycle support (you might
have looked at drivers that didn't utilize this yet), thus you don't
need to release the page (page_pool_release_page) here. Instead you
could simply mark the SKB for recycling, unless driver does some page refcnt
tricks I didn't notice.

On the mvneta driver/platform the DMA unmap (in
page_pool_release_page) was very expensive. This imx8qxp platform
might have faster DMA unmap in case is it cache-coherent.

I would be very interested in knowing if skb_mark_for_recycle() helps
on this platform, for normal network stack performance.


Did a quick compare testing for the following 3 scenarios:

Thanks for doing this! :-)

1. original implementation

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 104 MBytes 868 Mbits/sec
[ 1] 1.0000-2.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 2.0000-3.0000 sec 105 MBytes 881 Mbits/sec
[ 1] 3.0000-4.0000 sec 105 MBytes 879 Mbits/sec
[ 1] 4.0000-5.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 5.0000-6.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 6.0000-7.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 7.0000-8.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 8.0000-9.0000 sec 104 MBytes 873 Mbits/sec
[ 1] 9.0000-10.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 0.0000-10.0073 sec 1.02 GBytes 875 Mbits/sec

2. Page pool with page_pool_release_page

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 35924 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 101 MBytes 849 Mbits/sec
[ 1] 1.0000-2.0000 sec 102 MBytes 860 Mbits/sec
[ 1] 2.0000-3.0000 sec 102 MBytes 860 Mbits/sec
[ 1] 3.0000-4.0000 sec 102 MBytes 859 Mbits/sec
[ 1] 4.0000-5.0000 sec 103 MBytes 863 Mbits/sec
[ 1] 5.0000-6.0000 sec 103 MBytes 864 Mbits/sec
[ 1] 6.0000-7.0000 sec 103 MBytes 863 Mbits/sec
[ 1] 7.0000-8.0000 sec 103 MBytes 865 Mbits/sec
[ 1] 8.0000-9.0000 sec 103 MBytes 862 Mbits/sec
[ 1] 9.0000-10.0000 sec 102 MBytes 856 Mbits/sec
[ 1] 0.0000-10.0246 sec 1.00 GBytes 858 Mbits/sec


3. page pool with skb_mark_for_recycle

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 42724 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 111 MBytes 931 Mbits/sec
[ 1] 1.0000-2.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 2.0000-3.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 3.0000-4.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 4.0000-5.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 5.0000-6.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 6.0000-7.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 7.0000-8.0000 sec 111 MBytes 933 Mbits/sec
[ 1] 8.0000-9.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 9.0000-10.0000 sec 111 MBytes 933 Mbits/sec
[ 1] 0.0000-10.0069 sec 1.09 GBytes 934 Mbits/sec

This is a very significant performance improvement (page pool with
skb_mark_for_recycle). This is very close to the max goodput for a 1Gbit/s link.


For small packet size (64 bytes), all three cases have almost the same result:


To me this indicate, that the DMA map/unmap operations on this platform are
indeed more expensive on larger packets. Given this is what page_pool does,
keeping the DMA mapping intact when recycling.

Driver still need DMA-sync, although I notice you set page_pool feature flag
PP_FLAG_DMA_SYNC_DEV, this is good as page_pool will try to reduce sync size
where possible. E.g. in this SKB case will reduce the DMA-sync to the
max_len=FEC_ENET_RX_FRSIZE which should also help on performance.


shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 58204 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 36.9 MBytes 309 Mbits/sec
[ 1] 1.0000-2.0000 sec 36.6 MBytes 307 Mbits/sec
[ 1] 2.0000-3.0000 sec 36.6 MBytes 307 Mbits/sec
[ 1] 3.0000-4.0000 sec 36.5 MBytes 307 Mbits/sec
[ 1] 4.0000-5.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 5.0000-6.0000 sec 37.2 MBytes 312 Mbits/sec
[ 1] 6.0000-7.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 7.0000-8.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 8.0000-9.0000 sec 37.1 MBytes 312 Mbits/sec
[ 1] 9.0000-10.0000 sec 37.2 MBytes 312 Mbits/sec
[ 1] 0.0000-10.0097 sec 369 MBytes 310 Mbits/sec

Regards,
Shenwei


By small packets, do you mean those under the copybreak limit?

Please provide some benchmark numbers with your next patchset.

Yes, the packet size is 64 bytes and it is under the copybreak limit.
As the impact is not significant, I would prefer to remove the
copybreak logic.

+1 to removing this logic if possible, due to maintenance cost.

--Jesper