Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Alexander Duyck
Date: Wed Jun 07 2023 - 11:06:28 EST


On Wed, Jun 7, 2023 at 5:46 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> On 2023/6/6 23:33, Alexander Duyck wrote:
> >> Do you mean doing something like below? isn't it dirtying the cache line
> >> of 'struct page' whenever a page is recycled, which means we may not be
> >> able to the maintain current performance for non-fragmented or mono-frag
> >> case?
> >>
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -583,6 +583,10 @@ static __always_inline struct page *
> >> __page_pool_put_page(struct page_pool *pool, struct page *page,
> >> unsigned int dma_sync_size, bool allow_direct)
> >> {
> >> +
> >> + if (!page_pool_defrag_page(page, 1))
> >> + return NULL;
> >> +
> >
> > Yes, that is pretty much it. This would be your standard case page
> > pool put path. Basically it allows us to start getting rid of a bunch
> > of noise in the fragmented path.
> >
> >> /* This allocator is optimized for the XDP mode that uses
> >> * one-frame-per-page, but have fallbacks that act like the
> >> * regular page allocator APIs.
> >> @@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >> */
> >> if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >> /* Read barrier done in page_ref_count / READ_ONCE */
> >> + page_pool_fragment_page(page, 1);
> >
> > I wouldn't bother resetting this to 1 until after you have recycled it
> > and pulled it back out again as an allocation. Basically when the
> > pages are sitting in the pool the frag_count should be 0. That way it
> > makes it easier to track and is similar to how the memory allocator
> > actually deals with the page reference count. Basically if the page is
> > sitting in the pool the frag_count is 0, once it comes out it should
> > be 1 or more indicating it is in use.
>
> Let's be more specific about what we want to do here:
>
> For a specific page without splitting, the journey that it will go
> through is as below before this patch:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info().
> 3. It is passed to driver by page pool.
> 4. It is passed to stack by the driver.
> 5. It is passed back to the page pool by the stack, depending on the
> page_ref_count() checking:
> 5.1 page_ref_count() being one, the page is now owned by the page
> pool, and may be passed to the driver by going to step 3.
> 5.2 page_ref_count() not being one, the page is released by page
> pool doing resoure cleaning like dma mapping and put_page().
>
> So a page may go through step 3 ~ 5.1 many times without dirtying
> the cache line of 'struct page' as my understanding.
>
>
> If I follow your suggestion here, It seems for a specific page without
> splitting, it may go through:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info().
> 3. It's pp_frag_count is set to one.
> 4. It is passed to driver by page pool.
> 5. It is passed to stack by the driver.
> 6. It is passed back to the page pool by the stack, depending on the
> pp_frag_count and page_ref_count() checking:
> 6.1 pp_frag_count and page_ref_count() being one, the page is now
> owned by the page pool, and may be passed to the driver by
> going to step 3.
> 6.2 otherwise the page is released by page pool doing resoure
> cleaning like dma mapping and put_page().
>
> Aren't we dirtying the cache line of 'struct page' everytime the page
> is recycled? Or did I miss something obvious here?

What you are stating makes sense. So we would probably want to keep
the pp_frag_count at 1 when it is sitting in the pool.

> For my implementation, for a specific page without splitting, it may
> go through:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info() and it's pp_frag_count
> set to one.
> 3. It is passed to driver by page pool.
> 4. It is passed to stack by the driver.
> 5. It is passed back to the page pool by the stack, depending on the
> page_ref_count() checking:
> 5.1 pp_frag_count and page_ref_count() being one, the page is now
> owned by the page pool, and as the optimization by not updating
> page->pp_frag_count in page_pool_defrag_page() for the last
> frag user, it can be passed to the driver by going to step 3
> without resetting the pp_frag_count to 1, which may dirty
> the cache line of 'struct page'.
> 5.2 otherwise the page is released by page pool doing resoure
> cleaning like dma mapping and put_page().
>
> Does it make any sense, or it doesn't really matter we are dirtying
> the cache line of 'struct page' whenever a page without splitted is
> recycled?

No, that makes sense. No point in dirtying a cache line if we don't have to.

> >>
> >> Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
> >> "freeing it" mean calling put_page()? What does 'combined' really means
> >> here?
> >
> > The change is that the code would do the subtraction and if it hit 0
> > it was freeing the page. That is the one piece that gets more
> > complicated because we really should be hitting 1. So we may be adding
> > a few more operations to that case.
> >
> >>>
>
> I am not sure I understand it. Does 'gets more complicated' means doing
> some optimization like page_pool_defrag_page() does?
> https://elixir.bootlin.com/linux/v6.4-rc5/source/include/net/page_pool.h#L314

Our standard case is to decrement by 1. We will need to for the code
that is doing your step 5.1 to handle a case where we are removing
multiple frag references. That is what I was getting at with the "more
complicated" comment. Basically if we push it to 0 then we either have
to free or recycle the page by resetting the fragments.

> >
> >>>
> >>> With that you could then let drivers like the Mellanox one handle its
> >>> own fragmenting knowing it has to return things to a mono-frag in
> >>> order for it to be working correctly.
> >>
> >> I still really don't how it will be better for mlx5 to handle its
> >> own fragmenting yet?
> >>
> >> +cc Dragos & Saeed to share some info here, so that we can see
> >> if page pool learn from it.
> >
> > It has more to do with the fact that the driver knows what it is going
> > to do beforehand. In many cases it can look at the page and know that
> > it isn't going to reuse it again so it can just report the truesize
> > being the length from the current pointer to the end of the page.
> >
> > You can think of it as the performance advantage of a purpose built
> > ASIC versus a general purpose CPU. The fact is we are able to cut out
> > much of the unnecessary overhead if we know exactly how we are going
> > to use the memory in the driver versus having to guess at it in the
> > page pool API.
>
> In general, I would agree with that.
> But for the specific case with mlx5, I am not sure about that, that's
> why I am curious about what is the exact reason about it doing the
> complicated frag_count handing in the driver instead of improving
> the page pool to support it's usecase, if it is about the last frag
> truesize problem here, we can do something like virtio_net do in the
> page pool too.

I suspect it has to do with their hardware doing the fragmentation of
the page. As I recall some of the Mellanox parts support Rx packing so
it is likely that their hardware is packing multiple frames into a
single page and so they are marking it pre-fragmented, and then when
the hardware completes the DMA they go through and record the offsets
for the individual fragments and pass them up the stack.