RE: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support

From: Shenwei Wang
Date: Mon Oct 03 2022 - 08:50:03 EST


Hi Jesper,

> >> On mvneta driver/platform we saw huge speedup replacing:
> >>
> >> page_pool_release_page(rxq->page_pool, page); with
> >> skb_mark_for_recycle(skb);
> >>

After replacing the page_pool_release_page with the skb_mark_for_recycle, I found something confused me a little in the testing result.
I tested with the sample app of "xdpsock" under two modes: 1. Native (xdpsock -i eth0). 2. Skb-mode (xdpsock -S -i eth0).
The following are the testing result:
With page_pool_release_page (pps) With skb_mark_for_recycle (pps)

SKB-Mode 90K 200K
Native 190K 190K

The skb_mark_for_recycle solution boosted the performance of SKB-Mode to 200K+ PPS. That is even higher than the
performance of Native solution. Is this result reasonable? Do you have any clue why the SKB-Mode performance can
go higher than that of Native one?

Thanks,
Shenwei



> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
> Sent: Thursday, September 29, 2022 1:55 PM
> To: Shenwei Wang <shenwei.wang@xxxxxxx>; Jesper Dangaard Brouer
> <jbrouer@xxxxxxxxxx>; Andrew Lunn <andrew@xxxxxxx>
> Cc: brouer@xxxxxxxxxx; Joakim Zhang <qiangqing.zhang@xxxxxxx>; David S.
> Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub
> Kicinski <kuba@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; Alexei
> Starovoitov <ast@xxxxxxxxxx>; Daniel Borkmann <daniel@xxxxxxxxxxxxx>;
> Jesper Dangaard Brouer <hawk@xxxxxxxxxx>; John Fastabend
> <john.fastabend@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; imx@xxxxxxxxxxxxxxx
> Subject: Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support
>
> Caution: EXT Email
>
> On 29/09/2022 17.52, Shenwei Wang wrote:
> >
> >> From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
> >>
> >> On 29/09/2022 15.26, Shenwei Wang wrote:
> >>>
> >>>> From: Andrew Lunn <andrew@xxxxxxx>
> >>>> Sent: Thursday, September 29, 2022 8:23 AM
> >> [...]
> >>>>
> >>>>> I actually did some compare testing regarding the page pool for
> >>>>> normal traffic. So far I don't see significant improvement in the
> >>>>> current implementation. The performance for large packets improves
> >>>>> a little, and the performance for small packets get a little worse.
> >>>>
> >>>> What hardware was this for? imx51? imx6? imx7 Vybrid? These all use the
> FEC.
> >>>
> >>> I tested on imx8qxp platform. It is ARM64.
> >>
> >> On mvneta driver/platform we saw huge speedup replacing:
> >>
> >> page_pool_release_page(rxq->page_pool, page); with
> >> skb_mark_for_recycle(skb);
> >>
> >> As I mentioned: Today page_pool have SKB recycle support (you might
> >> have looked at drivers that didn't utilize this yet), thus you don't
> >> need to release the page (page_pool_release_page) here. Instead you
> >> could simply mark the SKB for recycling, unless driver does some page refcnt
> tricks I didn't notice.
> >>
> >> On the mvneta driver/platform the DMA unmap (in
> >> page_pool_release_page) was very expensive. This imx8qxp platform
> >> might have faster DMA unmap in case is it cache-coherent.
> >>
> >> I would be very interested in knowing if skb_mark_for_recycle() helps
> >> on this platform, for normal network stack performance.
> >>
> >
> > Did a quick compare testing for the following 3 scenarios:
>
> Thanks for doing this! :-)
>
> > 1. original implementation
> >
> > shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
> > ------------------------------------------------------------
> > Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
> > KByte (WARNING: requested 1.91 MByte)
> > ------------------------------------------------------------
> > [ 1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001
> > [ ID] Interval Transfer Bandwidth
> > [ 1] 0.0000-1.0000 sec 104 MBytes 868 Mbits/sec
> > [ 1] 1.0000-2.0000 sec 105 MBytes 878 Mbits/sec
> > [ 1] 2.0000-3.0000 sec 105 MBytes 881 Mbits/sec
> > [ 1] 3.0000-4.0000 sec 105 MBytes 879 Mbits/sec
> > [ 1] 4.0000-5.0000 sec 105 MBytes 878 Mbits/sec
> > [ 1] 5.0000-6.0000 sec 105 MBytes 878 Mbits/sec
> > [ 1] 6.0000-7.0000 sec 104 MBytes 875 Mbits/sec
> > [ 1] 7.0000-8.0000 sec 104 MBytes 875 Mbits/sec
> > [ 1] 8.0000-9.0000 sec 104 MBytes 873 Mbits/sec
> > [ 1] 9.0000-10.0000 sec 104 MBytes 875 Mbits/sec
> > [ 1] 0.0000-10.0073 sec 1.02 GBytes 875 Mbits/sec
> >
> > 2. Page pool with page_pool_release_page
> >
> > shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
> > ------------------------------------------------------------
> > Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
> > KByte (WARNING: requested 1.91 MByte)
> > ------------------------------------------------------------
> > [ 1] local 10.81.17.20 port 35924 connected with 10.81.16.245 port 5001
> > [ ID] Interval Transfer Bandwidth
> > [ 1] 0.0000-1.0000 sec 101 MBytes 849 Mbits/sec
> > [ 1] 1.0000-2.0000 sec 102 MBytes 860 Mbits/sec
> > [ 1] 2.0000-3.0000 sec 102 MBytes 860 Mbits/sec
> > [ 1] 3.0000-4.0000 sec 102 MBytes 859 Mbits/sec
> > [ 1] 4.0000-5.0000 sec 103 MBytes 863 Mbits/sec
> > [ 1] 5.0000-6.0000 sec 103 MBytes 864 Mbits/sec
> > [ 1] 6.0000-7.0000 sec 103 MBytes 863 Mbits/sec
> > [ 1] 7.0000-8.0000 sec 103 MBytes 865 Mbits/sec
> > [ 1] 8.0000-9.0000 sec 103 MBytes 862 Mbits/sec
> > [ 1] 9.0000-10.0000 sec 102 MBytes 856 Mbits/sec
> > [ 1] 0.0000-10.0246 sec 1.00 GBytes 858 Mbits/sec
> >
> >
> > 3. page pool with skb_mark_for_recycle
> >
> > shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
> > ------------------------------------------------------------
> > Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
> > KByte (WARNING: requested 1.91 MByte)
> > ------------------------------------------------------------
> > [ 1] local 10.81.17.20 port 42724 connected with 10.81.16.245 port 5001
> > [ ID] Interval Transfer Bandwidth
> > [ 1] 0.0000-1.0000 sec 111 MBytes 931 Mbits/sec
> > [ 1] 1.0000-2.0000 sec 112 MBytes 935 Mbits/sec
> > [ 1] 2.0000-3.0000 sec 111 MBytes 934 Mbits/sec
> > [ 1] 3.0000-4.0000 sec 111 MBytes 934 Mbits/sec
> > [ 1] 4.0000-5.0000 sec 111 MBytes 934 Mbits/sec
> > [ 1] 5.0000-6.0000 sec 112 MBytes 935 Mbits/sec
> > [ 1] 6.0000-7.0000 sec 111 MBytes 934 Mbits/sec
> > [ 1] 7.0000-8.0000 sec 111 MBytes 933 Mbits/sec
> > [ 1] 8.0000-9.0000 sec 112 MBytes 935 Mbits/sec
> > [ 1] 9.0000-10.0000 sec 111 MBytes 933 Mbits/sec
> > [ 1] 0.0000-10.0069 sec 1.09 GBytes 934 Mbits/sec
>
> This is a very significant performance improvement (page pool with
> skb_mark_for_recycle). This is very close to the max goodput for a 1Gbit/s link.
>
>
> > For small packet size (64 bytes), all three cases have almost the same result:
> >
>
> To me this indicate, that the DMA map/unmap operations on this platform are
> indeed more expensive on larger packets. Given this is what page_pool does,
> keeping the DMA mapping intact when recycling.
>
> Driver still need DMA-sync, although I notice you set page_pool feature flag
> PP_FLAG_DMA_SYNC_DEV, this is good as page_pool will try to reduce sync size
> where possible. E.g. in this SKB case will reduce the DMA-sync to the
> max_len=FEC_ENET_RX_FRSIZE which should also help on performance.
>
>
> > shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64
> > ------------------------------------------------------------
> > Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
> > KByte (WARNING: requested 1.91 MByte)
> > ------------------------------------------------------------
> > [ 1] local 10.81.17.20 port 58204 connected with 10.81.16.245 port 5001
> > [ ID] Interval Transfer Bandwidth
> > [ 1] 0.0000-1.0000 sec 36.9 MBytes 309 Mbits/sec
> > [ 1] 1.0000-2.0000 sec 36.6 MBytes 307 Mbits/sec
> > [ 1] 2.0000-3.0000 sec 36.6 MBytes 307 Mbits/sec
> > [ 1] 3.0000-4.0000 sec 36.5 MBytes 307 Mbits/sec
> > [ 1] 4.0000-5.0000 sec 37.1 MBytes 311 Mbits/sec
> > [ 1] 5.0000-6.0000 sec 37.2 MBytes 312 Mbits/sec
> > [ 1] 6.0000-7.0000 sec 37.1 MBytes 311 Mbits/sec
> > [ 1] 7.0000-8.0000 sec 37.1 MBytes 311 Mbits/sec
> > [ 1] 8.0000-9.0000 sec 37.1 MBytes 312 Mbits/sec
> > [ 1] 9.0000-10.0000 sec 37.2 MBytes 312 Mbits/sec
> > [ 1] 0.0000-10.0097 sec 369 MBytes 310 Mbits/sec
> >
> > Regards,
> > Shenwei
> >
> >
> >>>> By small packets, do you mean those under the copybreak limit?
> >>>>
> >>>> Please provide some benchmark numbers with your next patchset.
> >>>
> >>> Yes, the packet size is 64 bytes and it is under the copybreak limit.
> >>> As the impact is not significant, I would prefer to remove the
> >>> copybreak logic.
> >>
> >> +1 to removing this logic if possible, due to maintenance cost.
> >>
> >> --Jesper
> >