Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

From: Mina Almasry
Date: Wed Jul 05 2023 - 21:17:56 EST


On Mon, Jul 3, 2023 at 2:43 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote:
> > On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@xxxxxxxxxx> wrote:
> > >
> > > On 6/29/23 8:27 PM, Mina Almasry wrote:
> > > >
> > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > > > actually, and I plan to pursue using the page pool as a front end.
> > > >
> > > > Quick description of what I have so far:
> > > > current implementation uses device memory with struct pages; I am
> > > > putting all those pages in a gen_pool, and we have written an
> > > > allocator that allocates pages from the gen_pool. In the driver, we
> > > > use this allocator instead of alloc_page() (the driver in question is
> > > > gve which currently doesn't use the page pool). When the driver is
> > > > done with the p2p page, it simply decrements the refcount on it and
> > > > the page is freed back to the gen_pool.
> >
> > Quick update here, I was able to get my implementation working with
> > the page pool as a front end with the memory provider API Jakub wrote
> > here:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > The main complication indeed was the fact that my device memory pages
> > are ZONE_DEVICE pages, which are incompatible with the page_pool due
> > to the union in struct page. I thought of a couple of approaches to
> > resolve that.
> >
> > 1. Make my device memory pages non-ZONE_DEVICE pages.
>
> Hard no on this from a mm perspective.. We need P2P memory to be
> properly tagged and have the expected struct pages to be DMA mappable
> and otherwise, you totally break everything if you try to do this..
>
> > 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
> > vice versa as they're being inserted and removed from the page pool.
>
> This is kind of scary, it is very, very, fragile to rework the pages
> like this. Eg what happens when the owning device unplugs and needs to
> revoke these pages? I think it would likely crash..
>
> I think it also technically breaks the DMA API as we may need to look
> into the pgmap to do cache ops on some architectures.
>
> I suggest you try to work with 8k folios and then the tail page's
> struct page is empty enough to store the information you need..

Hi Jason, sorry for the late reply,

I think this could work, and the page pool already supports > order 0
allocations. It may end up being a big change to the GVE driver which
as I understand currently deals with order 0 allocations exclusively.

Another issue is that in networks with low MTU, we could be DMAing
1400/1500 bytes into each allocation, which is problematic if the
allocation is 8K+. I would need to investigate a bit to see if/how to
solve that, and we may end up having to split the page and again run
into the 'not enough room in struct page' problem.

> Or allocate per page memory and do a memdesc like thing..
>

I need to review memdesc more closely. Do you imagine I add a pointer
in struct page that points to the memdesc? Or implement a page to
memdesc mapping in the page_pool? Either approach could work. I think
the concern would be accessing the memdesc entries may be a cache miss
unacceptable in fast paths, but I think I already dereference
page->pgmap in a few places and it doesn't seem to be an issue.

> Though overall, you won't find devices creating struct pages for their
> P2P memory today, so I'm not sure what the purpose is. Jonathan
> already got highly slammed for proposing code to the kernel that was
> unusable. Please don't repeat that. Other than a special NVMe use case
> the interface for P2P is DMABUF right now and it is not struct page
> backed.
>

Our approach is actually to extend DMABUF to provide struct page
backed attachment mappings, which as far as I understand sidesteps the
issues Jonathan ran into. Our code is fully functional with any device
that supports dmabuf and in fact a lot of my tests use udmabuf to
minimize the dependencies. The RFC may come with a udmabuf selftest to
showcase that any dmabuf, even a mocked one, would be supported.

> Even if we did get to struct pages for device memory, it is highly
> likely cases you are interested in will be using larger than 4k
> folios, so page pool would need to cope with this nicely as well.
>

--
Thanks,
Mina