Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Alexander Lobakin
Date: Mon Jul 03 2023 - 09:39:16 EST


From: Alexander Duyck <alexander.duyck@xxxxxxxxx>
Date: Fri, 30 Jun 2023 11:28:30 -0700

> On Fri, Jun 30, 2023 at 8:34 AM Alexander Lobakin
> <aleksander.lobakin@xxxxxxxxx> wrote:

[...]

>> On my setup and with patch #4, I have literally 0 allocations once a
>> ring is filled. This means dma_need_sync() is not called at all during
>> Rx, while sync_for_device() would be called all the time.
>> When pages go through ptr_ring, sometimes new allocations happen, but
>> still the number of times dma_need_sync() is called is thousands times
>> lower.
>
> I see, so you are using it as a screener for pages as they are added
> to the pool. However the first time somebody trips the dma_need_sync
> then everybody in the pool is going to be getting hit with the sync
> code.

Right. "Better safe than sorry". If at least one page needs sync, let's
drop the shortcut. It won't make it worse than the mainline anyway.

>
>>> dma_need_sync for every frame then maybe we should look at folding it
>>> into page_pool_dma_sync_for_device itself since that is the only
>>> consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
>>> statement after the flag check rather than adding yet another flag
>>> that will likely always be true for most devices. Otherwise you are
>>
>> What you suggest is either calling dma_need_sync() each time a page is
>> requested or introducing a flag to store it somewhere in struct page to
>> allow some optimization for really-not-common-cases when dma_need_sync()
>> might return different values due to swiotlb etc. Did I get it right?
>
> Yeah, my thought would be to have a flag in the page to indicate if it
> will need the sync bits or not. Then you could look at exposing that
> to the drivers as well so that they could cut down on their own
> overhead. We could probably look at just embedding a flag in the lower
> bits of the DMA address stored in the page since I suspect at a
> minimum the resultant DMA address for a page would always be at least
> aligned to a long if not a full page.

As for drivers, I could add a wrapper like page_pool_need_sync() to test
for DMA_SYNC_DEV. I'm not sure it's worth it to check on a per-page basis.
Also, having that bit in the struct page forces us to always fetch it to
a cacheline. Right now, if the sync is skipped, this also avoids
touching struct page or at least postpones it. page_address() (for
&xdp_buff or build_skb()) doesn't touch it.

As for possible implementation, I also thought of the lowest bit of DMA
address. It probably can be lower than %PAGE_SIZE in some cases (at
least non-PP), but not to the 1-byte granularity.
But again, we need some group decision on if it's worth to do on a
per-page basis :)

Thanks,
Olek