Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

From: Yunsheng Lin
Date: Sat Jun 24 2023 - 10:44:43 EST

Next message: Björn Roy Baron: "Re: [PATCH 1/7] rust: init: consolidate init macros"
Previous message: Björn Roy Baron: "Re: [PATCH] rust: alloc: Add realloc and alloc_zeroed to the GlobalAlloc impl"
Next in thread: Yunsheng Lin: "Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023/6/21 19:55, Jesper Dangaard Brouer wrote:
>
>
> On 20/06/2023 23.16, Lorenzo Bianconi wrote:
>> [...]
>>
>>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>>> page_pools in a simple environment I used to test XDP for veth driver.
>>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>>> "linearizing" the packet (since we know the original skb length).
>>>> I run an iperf TCP connection over a veth pair where the
>>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>>> tree, with action XDP_PASS):
>>>>
>>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>>
>>>> net-next (page_pool):
>>>> - MTU 1500B: ~ 7.5 Gbps
>>>> - MTU 8000B: ~ 15.3 Gbps
>>>>
>>>> net-next + page_frag_alloc:
>>>> - MTU 1500B: ~ 8.4 Gbps
>>>> - MTU 8000B: ~ 14.7 Gbps
>>>>
>>>> It seems there is no a clear "win" situation here (at least in this environment
>>>> and we this simple approach). Moreover:
>>>
>>> For the 1500B packets it is a win, but for 8000B it looks like there
>>> is a regression. Any idea what is causing it?
>>
>> nope, I have not looked into it yet.
>>
>
> I think I can explain via using micro-benchmark numbers.
> (Lorenzo and I have discussed this over IRC, so this is our summary)
>
> *** MTU 1500***
>
> * The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):
>
> The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
> ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).
>
> The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
> for packets. The order-3 alloc + free cost 514 cycles (page_bench01:
> alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
> 1514+320+256 = 2090 bytes. In 32KB we can fit 15 packets. Thus, the
> amortized cost per packet is only 34.3 cycles (514/15).
>
> Thus, this explains why page_frag_alloc API have an advantage here, as
> amortized cost per packet is lower (for page_frag_alloc).
>
>
> *** MTU 8000 ***
>
> * The MTU 8000 case, where PP is faster than page_frag_alloc.
>
> The page_frag_alloc API cannot slice the same 32KB into as many packets.
> The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes. This is can
> only store 3 full packets (32768/8590 = 3.81).
> Thus, cost is 514/3 = 171 cycles.
>
> The PP is actually challenged at MTU 8000, because it unfortunately
> leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
> bytes. Thus cost is 3x 48 cycles = 144 cycles.
> (There is also a chance of Jakubs "allow_direct" optimization in page_pool_return_skb_page to increase performance for PP).
>
> Thus, this explains why PP is fastest in this case.

Great analysis.
So the problem seems to be: can we optimize the page fragment cache
implementation so that it can at least match the performance of PP
for the above case? As Alexander seems to be against using PP for
the veth case without involving DMA mapping.

>
>
> *** Surprising insights ***
>
> My (maybe) surprising conclusion is that we should combine the two
> approaches. Which is basically what Lin's patchset is doing!
> Thus, I'm actually suddenly become a fan of this patchset...
>
> The insight is that PP can also work with higher-order pages and the
> cost of PP recycles via ptr_ring will be the same, regardless of page
> order size. Thus, we can reduced the order-3 cost 514 cycles to
> basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
> amortized allocator cost 48/15 = 3.2 cycles.
>
> On the PP alloc-side this will be amazingly fast. When PP recycles frags
> side, see page_pool_defrag_page() there is an atomic_sub operation.
> I've measured atomic_inc to cost 17 cycles (for optimal non-contended
> case), thus 3+17 = 20 cycles, it should still be a win.
>
>
> --Jesper
>
>

Next message: Björn Roy Baron: "Re: [PATCH 1/7] rust: init: consolidate init macros"
Previous message: Björn Roy Baron: "Re: [PATCH] rust: alloc: Add realloc and alloc_zeroed to the GlobalAlloc impl"
Next in thread: Yunsheng Lin: "Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]