Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Alexander Lobakin
Date: Fri Jun 30 2023 - 11:35:36 EST


From: Alexander Duyck <alexander.duyck@xxxxxxxxx>
Date: Fri, 30 Jun 2023 07:44:45 -0700

> On Fri, Jun 30, 2023 at 5:30 AM Alexander Lobakin
> <aleksander.lobakin@xxxxxxxxx> wrote:
>>
>> From: Alexander H Duyck <alexander.duyck@xxxxxxxxx>
>> Date: Thu, 29 Jun 2023 09:45:26 -0700
>>
>>> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>>>> swiotlb, just for the call ladder.
>>>> Indeed, it's
>>
>> [...]
>>
>>>> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>
>>>> page_pool_set_dma_addr(page, dma);
>>>>
>>>> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>>>> + dma_need_sync(pool->p.dev, dma)) {
>>>> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>>>> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>>>> + }
>>>> +
>>>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>>>>
>>>
>>> I am pretty sure the logic is flawed here. The problem is
>>> dma_needs_sync depends on the DMA address being used. In the worst case
>>> scenario we could have a device that has something like a 32b DMA
>>> address space on a system with over 4GB of memory. In such a case the
>>> higher addresses would need to be synced because they will go off to a
>>> swiotlb bounce buffer while the lower addresses wouldn't.
>>>
>>> If you were to store a flag like this it would have to be generated per
>>> page.
>>
>> I know when DMA might need syncing :D That's the point of this shortcut:
>> if at least one page needs syncing, I disable it for the whole pool.
>> It's a "better safe than sorry".
>> Using a per-page flag involves more changes and might hurt some
>> scenarios/setups. For example, non-coherent systems, where you always
>> need to do syncs. The idea was to give some improvement when possible,
>> otherwise just fallback to what we have today.
>
> I am not a fan of having the page pool force the syncing either. Last
> I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the

Please follow the logics of the patch.

1. The driver sets DMA_SYNC_DEV.
2. PP tries to shortcut and replaces it with MAYBE_SYNC.
3. If dma_need_sync() returns true for some page, it gets replaced back
to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.

OR

1. The driver doesn't set DMA_SYNC_DEV.
2. PP doesn't turn on MAYBE_SYNC.
3. No dma_need_sync() tests.

Where does PP force syncs for drivers which don't need them?

> driver, not by the page pool API itself. The big reason for that being
> that the driver in many cases will have to take care of the DMA sync
> itself instead of letting the allocator take care of it.
>
> Basically we are just trading off the dma_need_sync cost versus the
> page_pool_dma_sync_for_device cost. If we think it is a win to call

dma_need_sync() is called once per newly allocated and mapped page.
page_pool_dma_sync_for_device() is called each time you ask for a page.

On my setup and with patch #4, I have literally 0 allocations once a
ring is filled. This means dma_need_sync() is not called at all during
Rx, while sync_for_device() would be called all the time.
When pages go through ptr_ring, sometimes new allocations happen, but
still the number of times dma_need_sync() is called is thousands times
lower.

> dma_need_sync for every frame then maybe we should look at folding it
> into page_pool_dma_sync_for_device itself since that is the only
> consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
> statement after the flag check rather than adding yet another flag
> that will likely always be true for most devices. Otherwise you are

What you suggest is either calling dma_need_sync() each time a page is
requested or introducing a flag to store it somewhere in struct page to
allow some optimization for really-not-common-cases when dma_need_sync()
might return different values due to swiotlb etc. Did I get it right?

> just adding overhead for the non-exception case and devices that don't
> bother setting PP_FLAG_DMA_SYNC_DEV.

Thanks,
Olek