Re: [PATCH net-next 6/9] page_pool: avoid calling no-op externals when possible

From: Alexander Lobakin
Date: Fri Jul 28 2023 - 10:21:34 EST


From: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
Date: Fri, 28 Jul 2023 20:39:24 +0800

> On 2023/7/27 22:43, Alexander Lobakin wrote:
>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>> swiotlb, just for the call ladder.
>> Indeed, it's
>>
>> page_pool_put_page()
>> page_pool_put_defragged_page() <- external
>> __page_pool_put_page()
>> page_pool_dma_sync_for_device() <- non-inline
>> dma_sync_single_range_for_device()
>> dma_sync_single_for_device() <- external
>> dma_direct_sync_single_for_device()
>> dev_is_dma_coherent() <- exit
>>
>> For the inline functions, no guarantees the compiler won't uninline them
>> (they're clearly not one-liners and sometimes compilers uninline even
>> 2 + 2). The first external call is necessary, but the rest 2+ are done
>> for nothing each time, plus a bunch of checks here and there.
>> Since Page Pool mappings are long-term and for one "device + addr" pair
>> dma_need_sync() will always return the same value (basically, whether it
>> belongs to an swiotlb pool), addresses can be tested once right after
>> they're obtained and the result can be reused until the page is unmapped.
>> Define the new PP DMA sync operation type, which will mean "do DMA syncs
>> for the device, but only when needed" and turn it on by default when the
>> driver asks to sync pages. When a page is mapped, check whether it needs
>> syncs and if so, replace that "sync when needed" back to "always do
>> syncs" globally for the whole pool (better safe than sorry). As long as
>> the pool has no pages requiring DMA syncs, this cuts off a good piece
>> of calls and checks. When at least one page required it, the pool
>> conservatively falls back to "always call sync functions", no per-page
>> verdicts. It's a fairly rare case anyway that only a few pages would
>> require syncing.
>> On my x86_64, this gives from 2% to 5% performance benefit with no
>> negative impact for cases when IOMMU is on and the shortcut can't be
>> used.
>>
>
> It seems other subsystem may have the similar problem as page_pool,
> is it possible to implement this kind of trick in the dma subsystem
> instead of every subsystem inventing their own trick?

In the ladder I described above most of overhead comes from jumping
between Page Pool functions, not the generic DMA ones. Let's say I do
this shortcut in dma_sync_single_range_for_device(), that is too late
already to count on some good CPU saves.
Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's
not clear to me where to store this "we can shortcut" bit in that case.

>From "other subsystem" I remember only XDP sockets. There, they also
avoid calling their own non-inline functions in the first place, not the
generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via
some "generic" solution.

Thanks,
Olek