Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

From: Yunsheng Lin
Date: Wed Jul 12 2023 - 07:13:15 EST

Next message: Mark Brown: "[PATCH v2] PCI: dwc: Provide deinit callback for i.MX"
Previous message: Manivannan Sadhasivam: "Re: [PATCH 00/14] UFS: Add OPP and interconnect support"
In reply to: Alexander Lobakin: "Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023/7/12 0:37, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
> Date: Tue, 11 Jul 2023 19:39:28 +0800
>
>> On 2023/7/10 21:25, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <yunshenglin0825@xxxxxxxxx>
>>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>>
>>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>>> From: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
>>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>>
>>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>>
>>>>>>> +/**
>>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>>> + *
>>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>>> + */
>>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>>> + u32 size)
>>>>>>> +{
>>>>>>> + struct page_pool_params pp = {
>>>>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>>>>> + .pool_size = size,
>>>>>>> + .nid = NUMA_NO_NODE,
>>>>>>> + .dev = napi->dev->dev.parent,
>>>>>>> + .napi = napi,
>>>>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>>>>
>>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>>> when the page is split, as we do not really know what is the offset of
>>>>>> the frag of a page except for the first frag.
>>>>>
>>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>>> xdp->data - xdp->data_hard_start".
>>>>
>>>> So the problem seems to be if most of drivers have a similar reading as
>>>> libie does here, as .offset seems to have a clear semantics which is used
>>>> to skip dma sync operation for buffer range that is not touched by the
>>>> dma operation. Even if it happens to have the same value of "offset from
>>>> the start of the page or frag to the actual frame start", I am not sure
>>>> it is future-proofing to reuse it.
>>>
>>> Not sure I'm following :s
>>
>> It would be better to avoid accessing the internal data of the page pool
>> directly as much as possible, as that may be changed to different meaning
>> or removed if the implememtation is changed.
>>
>> If it is common enough that most drivers are using it the same way, adding
>> a helper for that would be great.
>
> How comes page_pool_params is internal if it's defined purely by the
> driver and then exists read-only :D I even got warned in the adjacent
> thread that the Page Pool core code shouldn't change it anyhow.

Personally I am not one hundred percent convinced that page_pool_params
will not be changed considering the discussion about improving/replacing
the page pool to support P2P case.

>
>>
>>>
>>>>
>>>> When page frag is added, I didn't really give much thought about that as
>>>> we use it in a cache coherent system.
>>>> It seems we might need to extend or update that semantics if we really want
>>>> to skip dma sync operation for all the buffer ranges that are not touched
>>>> by the dma operation for page split case.
>>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>>> the effort, because it might need a per frag dma sync operation, which is
>>>> more costly than a batched per page dma sync operation. If it is true, page
>>>> pool already support that currently as my understanding, because the dma
>>>> sync operation is only done when the last frag is released/freed.
>>>>
>>>>>
>>>>>>
>>>>>>> + };
>>>>>>> + size_t truesize;
>>>>>>> +
>>>>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>>
>>>> As mentioned above, if we depend on the last released/freed frag to do the
>>>> dma sync, the pp.max_len might need to cover all the frag.
>>>
>>> ^^^^^^^^^^^^
>>>
>>> You mean the whole page or...?
>>
>> If we don't care about the accurate dma syncing, "cover all the frag" means
>> the whole page here, as page pool doesn't have enough info to do accurate
>> dma sync for now.
>>
>>> I think it's not the driver's duty to track all this. We always set
>>> .offset to `data - data_hard_start` and .max_len to the maximum
>>> HW-writeable length for one frame. We don't know whether PP will give us
>>> a whole page or just a piece. DMA sync for device is performed in the PP
>>> core code as well. Driver just creates a PP and don't care about the
>>> internals.
>>
>> There problem is that when page_pool_put_page() is called with a split
>> page, the page pool does not know which frag is freeing too.
>>
>> setting 'maximum HW-writeable length for one frame' only sync the first
>> frag of a page as below:
>
> Maybe Page Pool should synchronize DMA even when !last_frag then?
> Setting .max_len to anything bigger than the maximum frame size you're
> planning to receive is counter-intuitive.

This is simplest way to support dma sync for page frag case, the question
is if the batching of the dma sync for all frag of a page can even out the
additional cost of dma sync for dma untouched range of a page.

> All three xdp_buff, xdp_frame and skb always have all info needed to
> determine which piece of the page we're recycling, it should be possible
> to do with no complications. Hypothetical forcing drivers to do DMA
> syncs on their own when they use frags is counter-intuitive as well,
> Page Pool should be able to handle this itself.
>
> Alternatively, Page Pool may do as follows:
>
> 1. !last_frag -- do nothing, same as today.
> 2. last_frag -- sync, but not [offset, offset + max_len), but
> [offset, PAGE_SIZE).

When a frag is free, we don't know if it is the last frag or not for
now yet.

>
> This would also cover non-HW-writeable pieces like 2th-nth frame's
> headroom and each frame's skb_shared_info, but it's the only alternative
> to syncing each frag separately.
> Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
> said, it feels weird to set .max_len to 4k when you allocate 2k frags.
> You don't know anyway how much of a page will be used.

In that that case, we may need to make it more generic for the case when
a page is spilt into more than two frags, especially for system with 64K
page size.

Next message: Mark Brown: "[PATCH v2] PCI: dwc: Provide deinit callback for i.MX"
Previous message: Manivannan Sadhasivam: "Re: [PATCH 00/14] UFS: Add OPP and interconnect support"
In reply to: Alexander Lobakin: "Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]