Re: [PATCH net-next] page pool: not return page to alloc cache during pool destruction

From: Jesper Dangaard Brouer
Date: Thu Jun 15 2023 - 10:01:18 EST




On 15/06/2023 06.20, Jakub Kicinski wrote:
On Thu, 15 Jun 2023 09:36:45 +0800 Liang Chen wrote:
When destroying a page pool, the alloc cache and recycle ring are emptied.
If there are inflight pages, the retry process will periodically check the
recycle ring for recently returned pages, but not the alloc cache (alloc
cache is only emptied once). As a result, any pages returned to the alloc
cache after the page pool destruction will be stuck there and cause the
retry process to continuously look for inflight pages and report warnings.

To safeguard against this situation, any pages returning to the alloc cache
after pool destruction should be prevented.

Let's hear from the page pool maintainers but I think the driver
is supposed to prevent allocations while pool is getting destroyed.

Yes, this is a driver API violation. Direct returns (allow_direct) can
only happen from drivers RX path, e.g while driver is active processing
packets (in NAPI). When driver is shutting down a page_pool, it MUST
have stopped RX path and NAPI (napi_disable()) before calling
page_pool_destroy() Thus, this situation cannot happen and if it does
it is a driver bug.

Perhaps we can add DEBUG_NET_WARN_ON_ONCE() for this condition to
prevent wasting cycles in production builds?


For this page_pool code path ("allow_direct") it is extremely important
we avoid wasting cycles in production. As this is used for XDP_DROP
use-cases for 100Gbit/s NICs.

At 100Gbit/s with 64 bytes Ethernet frames (84 on wire), the wirespeed
is 148.8Mpps which gives CPU 6.72 nanosec to process each packet.
The microbench[1] shows (below signature) that page_pool_alloc_pages() +
page_pool_recycle_direct() cost 4.041 ns (or 14 cycles(tsc)).
Thus, for this code fast-path every cycle counts.

In practice PCIe transactions/sec seems limit total system to 108Mpps
(with multiple RX-queues + descriptor compression) thus 9.26 nanosec to
process each packet. Individual hardware RX queues seems be limited to
around 36Mpps thus 27.77 nanosec to process each packet.

Adding a DEBUG_NET_WARN_ON_ONCE will be annoying as I like to run my
testlab kernels with CONFIG_DEBUG_NET, which will change this extreme
fash-path slightly (adding some unlikely's affecting code layout to the
mix).

Question to Liang Chen: Did you hit this bug in practice?

--Jesper

CPU E5-1650 v4 @ 3.60GHz
tasklet_page_pool01_fast_path Per elem: 14 cycles(tsc) 4.041 ns
tasklet_page_pool02_ptr_ring Per elem: 49 cycles(tsc) 13.622 ns
tasklet_page_pool03_slow Per elem: 162 cycles(tsc) 45.198 ns

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c