Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

From: Mina Almasry
Date: Mon Jul 03 2023 - 02:22:57 EST


On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@xxxxxxxxxx> wrote:
>
> On 6/29/23 8:27 PM, Mina Almasry wrote:
> >
> > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > actually, and I plan to pursue using the page pool as a front end.
> >
> > Quick description of what I have so far:
> > current implementation uses device memory with struct pages; I am
> > putting all those pages in a gen_pool, and we have written an
> > allocator that allocates pages from the gen_pool. In the driver, we
> > use this allocator instead of alloc_page() (the driver in question is
> > gve which currently doesn't use the page pool). When the driver is
> > done with the p2p page, it simply decrements the refcount on it and
> > the page is freed back to the gen_pool.

Quick update here, I was able to get my implementation working with
the page pool as a front end with the memory provider API Jakub wrote
here:
https://github.com/kuba-moo/linux/tree/pp-providers

The main complication indeed was the fact that my device memory pages
are ZONE_DEVICE pages, which are incompatible with the page_pool due
to the union in struct page. I thought of a couple of approaches to
resolve that.

1. Make my device memory pages non-ZONE_DEVICE pages. The issue with
that is that if the page is not ZONE_DEVICE, put_page(page) will
attempt to free it to the buddy allocator I think, which is not
correct. The only places where the mm stack currently allow custom
freeing callback (AFAIK), are for ZONE_DEVICE page where
free_zone_device_page() will call the provided callback in
page->pgmap->ops->page_free, and compound pages where the
compound_dtor is specified. My device memory pages aren't compound
pages so only ZONE_DEVICE pages do what I want.

2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
vice versa as they're being inserted and removed from the page pool.
This, I think, works elegantly without any issue, and is the option I
went with. The info from ZONE_DEVICE that I care about for device
memory TCP is the page->zone_device_data which holds the dma_addr, and
the page->pgmap which holds the page_free op. I'm able to store both
in my memory provider so I can swap pages from ZONE_DEVICE and
page_pool back and forth.

So far I haven't needed to make any modifications to the memory
provider implementation Jakub has pretty much, and my functionality
tests are passing. If there are no major objections I'll look into
cleaning up the interface a bit and propose it for merge. This is a
prerequisite of device memory TCP via the page_pool.

>
> I take it these are typical Linux networking applications using standard
> socket APIs (not dpdk or xdp sockets or such)? If so, what does tcpdump
> show for those skbs with pages for the device memory?
>

Yes these are using (mostly) standing socket APIs. We have small
extensions to sendmsg() and recvmsg() to pass a reference to the
device memory in both these cases, but that's about it.

tcpdump is able to access the header of these skbs which is in host
memory, but not the payload in device memory. Here is an example
session with my netcat-like test for device memory TCP:
https://pastebin.com/raw/FRjKf0kv

tcpdump seems to work, and the length of the packets above is correct.
tcpdump -A however isn't able to print the payload of the packets:
https://pastebin.com/raw/2PcNxaZV

--
Thanks,
Mina