Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Yunsheng Lin
Date: Sat Jun 03 2023 - 09:10:41 EST


On 2023/6/3 0:37, Alexander H Duyck wrote:
...

>>
>> Please let me know if the above makes sense, or if misunderstood your
>> concern here.
>
> So my main concern is that what this is doing is masking things so that
> the veth and virtio_net drivers can essentially lie about the truesize
> of the memory they are using in order to improve their performance by
> misleading the socket layer about how much memory it is actually
> holding onto.
>
> We have historically had an issue with reporting the truesize of
> fragments, but generally the underestimation was kept to something less
> than 100% because pages were generally split by at least half. Where it
> would get messy is if a misbehaviing socket held onto packets for an
> exceedingly long time.
>
> What this patch set is doing is enabling explicit lying about the
> truesize, and it compounds that by allowing for mixing small
> allocations w/ large ones.
>
>>>
>>> The problem is there are some architectures where we just cannot
>>> support having pp_frag_count due to the DMA size. So it makes sense to
>>> leave those with just basic page pool instead of trying to fake that it
>>> is a fragmented page.
>>
>> It kind of depend on how you veiw it, this patch view it as only supporting
>> one frag when we can't support having pp_frag_count, so I would not call it
>> faking.
>
> So the big thing that make it "faking" is the truesize underestimation
> that will occur with these frames.

Let's discuss truesize issue in patch 2 instead of here.
Personally, I still believe that if the driver can compute the
truesize correctly by manipulating the page->pp_frag_count and
frag offset directly, the page pool can do that too.

>
>>
>>>
>>>> ---

...

>>>
>>> What is the point of this line? It doesn't make much sense to me. Are
>>> you just trying to force an optiimization? You would be better off just
>>> taking the BUILD_BUG_ON contents and feeding them into an if statement
>>> below since the statement will compile out anyway.
>>
>> if the "if statement" you said refers to the below, then yes.
>>
>>>> + if (!__builtin_constant_p(nr))
>>>> + atomic_long_set(&page->pp_frag_count, 1);
>>
>> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
>>
>> Will move it down anyway to avoid confusion.
>
> Actually now that I look at this more it is even more confusing. The
> whole point of this function was that we were supposed to be getting
> pp_frag_count to 0. However you are setting it to 1.
>
> This is seriously flawed. If we are going to treat non-fragmented pages
> as mono-frags then that is what we should do. We should be pulling this
> acounting into all of the page pool freeing paths, not trying to force
> the value up to 1 for the non-fragmented case.

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

>
>>>
>>> It seems like what you would want here is:
>>> BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
>>>
>>> Otherwise you are potentially writing to a variable that shouldn't
>>> exist.
>>
>> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
>> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
>> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
>> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
>> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
>> to keep the driver from implementation detail(pp_frag_count handling specifically)
>> of the frag support unless we have a very good reason.
>>
>
> Getting the truesize is that "very good reason". The fact is the
> drivers were doing this long before page pool came around. Trying to
> pull that away from them is the wrong way to go in my opinion.

If the truesize is really the concern here, I think it make more
sense to enforce it in the page pool instead of each driver doing
their trick, so I also think we can do better here to handle
pp_frag_count in the page pool instead of driver handling it, so
let's continue the truesize disscussion in patch 2 to see if we
can come up with something better there.

>
>>>> /* If nr == pp_frag_count then we have cleared all remaining
>>>> * references to the page. No need to actually overwrite it, instead
>>>> * we can leave this to be overwritten by the calling function.
>>>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>> * especially when dealing with a page that may be partitioned
>>>> * into only 2 or 3 pieces.
>>>> */
>>>> - if (atomic_long_read(&page->pp_frag_count) == nr)
>>>> + if (atomic_long_read(&page->pp_frag_count) == nr) {
>>>> + /* As we have ensured nr is always one for constant case
>>>> + * using the BUILD_BUG_ON() as above, only need to handle
>>>> + * the non-constant case here for frag count draining.
>>>> + */
>>>> + if (!__builtin_constant_p(nr))
>>>> + atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>> return 0;
>>>> + }
>>>>
>
> The optimization here was already the comparison since we didn't have
> to do anything if pp_frag_count == nr. The whole point of pp_frag_count
> going to 0 is that is considered non-fragmented in that case and ready
> to be freed. By resetting it to 1 you are implying that there is still
> one *other* user that is holding a fragment so the page cannot be
> freed.
>
> We weren't bothering with writing the value since the page is in the
> free path and this value is going to be unused until the page is
> reallocated anyway.

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

>
>>>> ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>> WARN_ON(ret < 0);
>>>> +
>>>> + /* Reset frag count back to 1, this should be the rare case when
>>>> + * two users call page_pool_defrag_page() currently.
>>>> + */
>>>> + if (!ret)
>>>> + atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>> return ret;
>>>> }
>>>>

...

>> As above, it is about unifying handling for frag and non-frag page in
>> page_pool_is_last_frag(). please let me know if there is any better way
>> to do it without adding statements here.
>
> I get what you are trying to get at but I feel like the implementation
> is going to cause more problems than it helps. The problem is it is
> going to hurt base page pool performance and it just makes the
> fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

>
> My advice as a first step would be to look at first solving how to
> enable the PP_FLAG_PAGE_FRAG mode when you have
> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> frags as we are calling them, and we should have a way to get the
> truesize for those so we know when we are consuming significant amount
> of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@xxxxxxxxxx/

>
> Once that is solved then we can look at what it would take to apply
> mono-frags to the standard page pool case. Ideally we would need to
> find a way to do it with minimal overhead.
>
>