Re: [PATCH net-next] page_pool: Clamp ring size to 32K

From: Alexander Lobakin
Date: Tue Aug 08 2023 - 14:45:22 EST


From: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Date: Mon, 7 Aug 2023 22:11:35 +0200

>
>
> On 07/08/2023 19.20, Jakub Kicinski wrote:
>> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in
>>>> link)
>>>> is creating an excessive large cache of pages.  The drivers RX
>>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>>> it is just a cache that grows as a functions of the in-flight packet
>>>> workload, it functions as a "shock absorber".
>>>>
>>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>>
>>>> The RX-desc ring (obviously) pins down these pages (immediately),
>>>> but PP
>>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>>> will let more pages into the system, that will travel the PP ptr_ring.
>>>> As all pages originating from the same PP instance will get recycled,
>>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>>
>>>> The PP design have the problem that it never releases or reduces pages
>>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>>> consider implementing a MM shrinker callback
>>>> (include/linux/shrinker.h).
>>>>
>>>> Are the systems using driver octeontx2 ready to handle 128MiB memory
>>>> per
>>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>>> do debug situation if the memory is not sufficient)
>>>
>>> I'm with Jesper on this. It doesn't make sense to be tying the
>>> page_pool size strictly to the ring size. The amount of recycling you
>>> get will depend on how long the packets are on the stack, not in the
>>> driver.
>>>
>
> Thanks for agreeing with me, and I agree with you :-)
>
>>> For example, in the case of something like a software router or bridge
>>> that is just taking the Rx packets and routing them to Tx you could
>>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>>> would likely never need much more than that as the Tx would likely be
>>> cleaned about as fast as the Rx can consume the pages.
>>>
>
> I agree.
>
>>> Rather than overriding the size here wouldn't it make more sense to do
>>> it in the octeontx2 driver? With that at least you would know that you
>>> were the one that limited the size instead of having the value modified
>>> out from underneath you.
>>>
>
> I'm not fully agreeing here.  I don't think we can expect driver
> developer to be experts on page_pool cache dynamics.  I'm more on
> Jakub's side here, as perhaps we/net-core can come up with some control
> system, even if this means we change this underneath drivers.
>
>
>>> That said, one change that might help to enable this kind of change
>>> would be look at adding a #define so that this value wouldn't be so
>>> much a magic number and would be visible to the drivers should it ever
>>> be changed in the future.
>>
>> All the points y'all making are valid, sizing the cache is a hard
>> problem. But the proposed solution goes in the wrong direction, IMO.
>> The driver doesn't know. I started hacking together page pool control
>> over netlink. I think that the pool size selection logic should be in
>> the core, with inputs taken from user space / workload (via netlink).
>>
>> If it wasn't for the fact that I'm working on that API I'd probably
>> side with you. And 64k descriptors is impractically large.
>>
>> Copy / pasting from the discussion on previous version:
>>
>>    Tuning this in the driver relies on the assumption that the HW /
>>    driver is the thing that matters. I'd think that the workload,
>>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>>    least as much. While driver developers will end up tuning to whatever
>>    servers they have, random single config and most likely.. iperf.
>>
>>    IMO it's much better to re-purpose "pool_size" and treat it as the
>> ring
>>    size, because that's what most drivers end up putting there.
>
> I disagree here, as driver developers should not treat "pool_size" as
> the ring size.  It seems to be a copy-paste-programming scheme without
> understanding PP dynamics.

+1. That's why I wrote in the previous thread that pool_size must be the
minimum value which gives optimal performance. I don't believe Otx2 HW
needs 32k entries in PP's ptr_ring to have optimal performance.
That's why I wrote that developers must check whether there's any
benefit in using bigger pool_size values. Values bigger than 2k don't
seem reasonable to me, especially now that we use direct recycling way
more aggressively -- often times ptr_ring is left unused at all.
Jakub thought that my "pls test whether bigger sizes make sense" meant
"please tune Page Pool to your servers", not exactly what I wanted to
say =\ I said only "please keep pool_size reasonable, it's your right to
have 2^32 descriptors on the ring, but don't do that with Page Pool".

>
>>    Defer tuning of the effective ring size to the core and user input
>>    (via the "it will be added any minute now" netlink API for configuring
>>    page pools)...
>>
>
> I agree here, that tuning ring size is a hard problem, and this is
> better handled in the core.  Happy to hear, that/if Jakub is working on
> this.
>
>>    So capping the recycle ring to 32k instead of returning the error
>> seems
>>    like an okay solution for now.
>
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay,
> if we can agree that the net-core should control tuning this anyhow.
>
> --Jesper

Thanks,
Olek