Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

From: Alexander Duyck
Date: Fri Jun 16 2023 - 15:51:14 EST


On Fri, Jun 16, 2023 at 11:41 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
>
> On 16/06/2023 19.34, Alexander Duyck wrote:
> > On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> > <jbrouer@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 16/06/2023 13.57, Yunsheng Lin wrote:
> >>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

[...]

>
>
> >>> if (!nskb) {
> >>> page_pool_put_full_page(rq->page_pool, page, true);
> >>> goto drop;
> >>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> skb_copy_header(nskb, skb);
> >>> skb_mark_for_recycle(nskb);
> >>>
> >>> - size = min_t(u32, skb->len, max_head_size);
> >>> if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >>> consume_skb(nskb);
> >>> goto drop;
> >>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> len = skb->len - off;
> >>>
> >>> for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> >>> - page = page_pool_dev_alloc_pages(rq->page_pool);
> >>> + size = min_t(u32, len, PAGE_SIZE);
> >>> + truesize = size;
> >>> +
> >>> + page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> >>> + &truesize);
> >>> if (!page) {
> >>> consume_skb(nskb);
> >>> goto drop;
> >>> }
> >>>
> >>> - size = min_t(u32, len, PAGE_SIZE);
> >>> - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> >>> + skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
> >>
> >> Guess, this shows the opposite; that the "page" _is_ used by the
> >> existing API.
> >
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.
>
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments. So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

>From what I can tell this is an exception case with multiple caveats
for shared, locked, or nonlinear frames.

As such I suspect there may be some deprecated cases in there too
since XDP multi-buf support has been around for a while so the code
shouldn't need to reallocate to linearize a nonlinear frame.

> >
> > One other question I have now that I look at this code as well. Why is
> > it using page_pool and not just a frag cache allocator, or pages
> > themselves? It doesn't seem like it has a DMA mapping to deal with
> > since this is essentially copy-break code. Seems problematic that
> > there is no DMA involved here at all. This could be more easily
> > handled with just a single page_frag_cache style allocator.
> >
>
> Yes, precisely.
> I distinctly remember what I tried to poke you and Eric on this approach
> earlier, but I cannot find a link to that email.
>
> I would really appreciate, if you Alex, could give the approach in
> veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> potential for improvements that will lead to large performance
> improvements. (I'm sure Maryam will be eager to help re-test performance
> for her use-cases).

Well just looking at it the quick and dirty answer would be to look at
making use of something like page_frag_cache. I won't go into details
since it isn't too different from the frag allocator, but it is much
simpler since it is just doing reference count hacks instead of having
to do the extra overhead to keep the DMA mapping in place. The veth
would then just be sitting on at most an order 3 page while it is
waiting to fully consume it rather than waiting on a full pool of
pages.

Alternatively it could do something similar to page_frag_alloc_align
itself and just bypass doing a custom allocator. If it went that route
it could do something almost like a ring buffer and greatly improve
the throughput since it would be able to allocate a higher order page
and just copy the entire skb in so the entire thing would be linear
rather than having to allocate a bunch of single pages.