Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Yunsheng Lin
Date: Wed Jun 07 2023 - 08:46:42 EST


On 2023/6/6 23:33, Alexander Duyck wrote:
>> Do you mean doing something like below? isn't it dirtying the cache line
>> of 'struct page' whenever a page is recycled, which means we may not be
>> able to the maintain current performance for non-fragmented or mono-frag
>> case?
>>
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -583,6 +583,10 @@ static __always_inline struct page *
>> __page_pool_put_page(struct page_pool *pool, struct page *page,
>> unsigned int dma_sync_size, bool allow_direct)
>> {
>> +
>> + if (!page_pool_defrag_page(page, 1))
>> + return NULL;
>> +
>
> Yes, that is pretty much it. This would be your standard case page
> pool put path. Basically it allows us to start getting rid of a bunch
> of noise in the fragmented path.
>
>> /* This allocator is optimized for the XDP mode that uses
>> * one-frame-per-page, but have fallbacks that act like the
>> * regular page allocator APIs.
>> @@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>> */
>> if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>> /* Read barrier done in page_ref_count / READ_ONCE */
>> + page_pool_fragment_page(page, 1);
>
> I wouldn't bother resetting this to 1 until after you have recycled it
> and pulled it back out again as an allocation. Basically when the
> pages are sitting in the pool the frag_count should be 0. That way it
> makes it easier to track and is similar to how the memory allocator
> actually deals with the page reference count. Basically if the page is
> sitting in the pool the frag_count is 0, once it comes out it should
> be 1 or more indicating it is in use.

Let's be more specific about what we want to do here:

For a specific page without splitting, the journey that it will go
through is as below before this patch:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
page_ref_count() checking:
5.1 page_ref_count() being one, the page is now owned by the page
pool, and may be passed to the driver by going to step 3.
5.2 page_ref_count() not being one, the page is released by page
pool doing resoure cleaning like dma mapping and put_page().

So a page may go through step 3 ~ 5.1 many times without dirtying
the cache line of 'struct page' as my understanding.


If I follow your suggestion here, It seems for a specific page without
splitting, it may go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It's pp_frag_count is set to one.
4. It is passed to driver by page pool.
5. It is passed to stack by the driver.
6. It is passed back to the page pool by the stack, depending on the
pp_frag_count and page_ref_count() checking:
6.1 pp_frag_count and page_ref_count() being one, the page is now
owned by the page pool, and may be passed to the driver by
going to step 3.
6.2 otherwise the page is released by page pool doing resoure
cleaning like dma mapping and put_page().

Aren't we dirtying the cache line of 'struct page' everytime the page
is recycled? Or did I miss something obvious here?

For my implementation, for a specific page without splitting, it may
go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info() and it's pp_frag_count
set to one.
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
page_ref_count() checking:
5.1 pp_frag_count and page_ref_count() being one, the page is now
owned by the page pool, and as the optimization by not updating
page->pp_frag_count in page_pool_defrag_page() for the last
frag user, it can be passed to the driver by going to step 3
without resetting the pp_frag_count to 1, which may dirty
the cache line of 'struct page'.
5.2 otherwise the page is released by page pool doing resoure
cleaning like dma mapping and put_page().

Does it make any sense, or it doesn't really matter we are dirtying
the cache line of 'struct page' whenever a page without splitted is
recycled?


>>
>> Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
>> "freeing it" mean calling put_page()? What does 'combined' really means
>> here?
>
> The change is that the code would do the subtraction and if it hit 0
> it was freeing the page. That is the one piece that gets more
> complicated because we really should be hitting 1. So we may be adding
> a few more operations to that case.
>
>>>

I am not sure I understand it. Does 'gets more complicated' means doing
some optimization like page_pool_defrag_page() does?
https://elixir.bootlin.com/linux/v6.4-rc5/source/include/net/page_pool.h#L314

>
>>>
>>> With that you could then let drivers like the Mellanox one handle its
>>> own fragmenting knowing it has to return things to a mono-frag in
>>> order for it to be working correctly.
>>
>> I still really don't how it will be better for mlx5 to handle its
>> own fragmenting yet?
>>
>> +cc Dragos & Saeed to share some info here, so that we can see
>> if page pool learn from it.
>
> It has more to do with the fact that the driver knows what it is going
> to do beforehand. In many cases it can look at the page and know that
> it isn't going to reuse it again so it can just report the truesize
> being the length from the current pointer to the end of the page.
>
> You can think of it as the performance advantage of a purpose built
> ASIC versus a general purpose CPU. The fact is we are able to cut out
> much of the unnecessary overhead if we know exactly how we are going
> to use the memory in the driver versus having to guess at it in the
> page pool API.

In general, I would agree with that.
But for the specific case with mlx5, I am not sure about that, that's
why I am curious about what is the exact reason about it doing the
complicated frag_count handing in the driver instead of improving
the page pool to support it's usecase, if it is about the last frag
truesize problem here, we can do something like virtio_net do in the
page pool too.

>
> .
>