Re: [RFC PATCH v3 07/12] page-pool: device memory support

From: Mina Almasry
Date: Thu Nov 09 2023 - 07:20:30 EST


On Thu, Nov 9, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> On 2023/11/9 11:20, Mina Almasry wrote:
> > On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> >
> > Agreed everything above is undoable.
> >
> >> But we might be able to do something as folio is doing now, mm subsystem
> >> is still seeing 'struct folio/page', but other subsystem like slab is using
> >> 'struct slab', and there is still some common fields shared between
> >> 'struct folio' and 'struct slab'.
> >>
> >
> > In my eyes this is almost exactly what I suggested in RFC v1 and got
> > immediately nacked with no room to negotiate. What we did for v1 is to
> > allocate struct pages for dma-buf to make dma-bufs look like struct
> > page to mm subsystem. Almost exactly what you're describing above.
>
> Maybe the above is where we have disagreement:
> Do we still need make dma-bufs look like struct page to mm subsystem?
> IMHO, the answer is no. We might only need to make dma-bufs look like
> struct page to net stack and page pool subsystem. I think that is already
> what this pacthset is trying to do, what I am suggesting is just make
> it more like 'struct page' to net stack and page pool subsystem, in order
> to try to avoid most of the 'if' checking in net stack and page pool
> subsystem.
>

First, most of the checking in the net stack is
skb_frag_not_readable(). dma-buf are fundamentally not kmap()able and
not readable. So we can't remove those, no matter what we do I think.
Can we agree on that? If so, lets discuss removing most of the ifs in
the page pool, only.

> > It's a no-go. I don't think renaming struct page to netmem is going to
> > move the needle (it also re-introduces code-churn). What I feel like I
> > learnt is that dma-bufs are not struct pages and can't be made to look
> > like one, I think.
> >
> >> As the netmem patchset, is devmem able to reuse the below 'struct netmem'
> >> and rename it to 'struct page_pool_iov'?
> >
> > I don't think so. For the reasons above, but also practically it
> > immediately falls apart. Consider this field in netmem:
> >
> > + * @flags: The same as the page flags. Do not use directly.
> >
> > dma-buf don't have or support page-flags, and making dma-buf looks
> > like they support page flags or any page-like features (other than
> > dma_addr) seems extremely unacceptable to mm folks.
>
> As far as I tell, as we limit the devmem usage in netstack, the below
> is the related mm function call for 'struct page' for devmem:
> page_ref_*(): page->_refcount does not need changing

Sorry, I don't understand. Are you suggesting we call page_ref_add() &
page_ref_sub() on page_pool_iov? That is basically making
page_pool_iov look like struct page to the mm stack, since page_ref_*
are mm calls, which you say above we don't need to do. We will still
need to special case this, no?

> page_is_pfmemalloc(): which is corresponding to page->pp_magic, and
> devmem provider can set/unset it in it's 'alloc_pages'
> ops.

page_is_pfmemalloc() has nothing to do with page->pp_magic. It checks
page->lru.next to figure out if this is a pfmemalloc. page_pool_iov
has no page->lru.next. Still need to special case this?

> page_to_nid(): we may need to handle it differently somewhat like this
> patch does as page_to_nid() may has different implementation
> based on different configuration.

So you're saying we need to handle page_to_nid() differently for
devmem? So we're not going to be able to avoid the if statement.

> page_pool_iov_put_many(): as mentioned in other thread, if net stack is not
> calling page_pool_page_put_many() directly, we
> can reuse napi_pp_put_page() for devmem too, and
> handle the special case for devmem in 'release_page'
> ops.
>

page_pool_iov_put_many()/page_pool_iov_get_many() are called to do
refcounting before the page is released back to the provider. I'm not
seeing how we can handle the special case inside of 'release_page' -
that's too late, as far as I can tell.

The only way to remove the if statements in the page pool is to
implement what you said was not feasible in an earlier email. We would
define this struct:

struct netmem {
/* common fields */
refcount_t refcount;
bool is_pfmemalloc;
int nid;
......
union {
struct devmem{
struct dmabuf_genpool_chunk_owner *owner;
};

struct page * page;
};
};

Then, we would require all memory providers to allocate struct netmem
for the memory and set the common fields, including ones that have
struct pages. For devmem, netmem->page will be NULL, because netmem
has no page.

If we do that, the page pool can ignore whether the underlying memory
is page or devmem, because it can use the common fields, example:

/* page_ref_count replacement */
netmem_ref_count(struct netmem* netmem) {
return netmem->refcount;
}

/* page_ref_add replacement */
netmem_ref_add(struct netmem* netmem) {
atomic_inc(netmem->refcount);
}

/* page_to_nid replacement */
netmem_nid(struct netmem* netmem) {
return netmem->nid;
}

/* page_is_pfmemalloc() replacement */
netmem_is_pfmemalloc(struct netmem* netmem) {
return netmem->is_pfmemalloc;
}

/* page_ref_sub replacement */
netmem_ref_sub(struct netmem* netmem) {
atomic_sub(netmet->refcount);
if (netmem->refcount == 0) {
/* release page to the memory provider.
* struct page memory provider will do put_page(),
* devmem will do something else */
}
}
}


I think this MAY BE technically feasible, but I'm not sure it's better:

1. It is a huge refactor to the page pool, lots of code churn. While
the page pool currently uses page*, it needs to be completely
refactored to use netmem*.
2. It causes extra memory usage. struct netmem needs to be allocated
for every struct page.
3. It has minimal perf upside. The page_is_page_pool_iov() checks
currently have minimal perf impact, and I demonstrated that to Jesper
in RFC v2.
4. It also may not be technically feasible. I'm not sure how netmem
interacts with skb_frag_t. I guess we replace struct page* bv_page
with struct netmem* bv_page, and add changes there.
5. Drivers need to be refactored to use netmem* instead of page*,
unless we cast netmem* to page* before returning to the driver.

Possibly other downsides, these are what I could immediately think of.

If I'm still misunderstanding your suggestion, it may be time to send
me a concrete code snippet of what you have in mind. I'm a bit
confused at the moment because the only avenue I see to remove the if
statements in the page pool is to define the struct that we agreed is
not feasible in earlier emails.

> >
> >> So that 'struct page' for normal
> >> memory and 'struct page_pool_iov' for devmem share the common fields used
> >> by page pool and net stack?
> >
> > Are you suggesting that we'd cast a netmem* to a page* and call core
> > mm APIs on it? It's basically what was happening with RFC v1, where
> > things that are not struct pages were made to look like struct pages.
> >
> > Also, there isn't much upside for what you're suggesting, I think. For
> > example I can align the refcount variable in struct page_pool_iov with
> > the refcount in struct page so that this works:
> >
> > put_page((struct page*)ppiov);
> >
> > but it's a disaster. Because put_page() will call __put_page() if the
> > page is freed, and __put_page() will try to return the page to the
> > buddy allocator!
>
> As what I suggested above, Can we handle this in devmem provider's
> 'release_page' ops instead of calling put_page() directly as for devmem.
>
> >
> >> And we might be able to reuse the 'flags',
> >> '_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough
> >> for the devmem only requiring a single pointer to point to it's
> >> owner?
> >>
> >
> > All the above seems quite similar to RFC v1 again, using netmem
> > instead of struct page. In RFC v1 we re-used zone_device_data() for
> > the dma-buf owner equivalent.
>
> As we have added a few checkings to limit 'struct page' for devmem to
> be only used in net stack, we can decouple 'struct page' for devmem
> from mm subsystem, zone_device_data() is not really needed, right?
>
> If we can decouple 'struct page' for normal memory from mm subsystem
> through the folio work in the future, then we may define a more abstract
> structure for page pool and net stack instead of reusing 'struct page'
> from mm.
>
> >



--
Thanks,
Mina