Re: [PATCH RFC 3/8] memory-provider: dmabuf devmem memory provider

From: Mina Almasry
Date: Tue Nov 14 2023 - 07:25:08 EST


On Tue, Nov 14, 2023 at 12:23 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> +cc Christian, Jason and Willy
>
> On 2023/11/14 7:05, Jakub Kicinski wrote:
> > On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote:
> >> You're doing exactly what I think you're doing, and what was nacked in RFC v1.
> >>
> >> You've converted 'struct page_pool_iov' to essentially become a
> >> duplicate of 'struct page'. Then, you're casting page_pool_iov* into
> >> struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling
> >> mm APIs like page_ref_*() on the page_pool_iov* because you've fooled
> >> the mm stack into thinking dma-buf memory is a struct page.
>
> Yes, something like above, but I am not sure about the 'fooled the mm
> stack into thinking dma-buf memory is a struct page' part, because:
> 1. We never let the 'struct page' for devmem leaking out of net stacking
> through the 'not kmap()able and not readable' checking in your patchset.

RFC never used dma-buf pages outside the net stack, so that is the same.

You are not able to get rid of the 'net kmap()able and not readable'
checking with this approach, because dma-buf memory is fundamentally
unkmapable and unreadable. This approach would still need
skb_frags_not_readable checks in net stack, so that is also the same.

> 2. We inititiate page->_refcount for devmem to one and it remains as one,
> we will never call page_ref_inc()/page_ref_dec()/get_page()/put_page(),
> instead, we use page pool's pp_frag_count to do reference counting for
> devmem page in patch 6.
>

I'm not sure that moves the needle in terms of allowing dma-buf
memory to look like struct pages.

> >>
> >> RFC v1 was almost exactly the same, except instead of creating a
> >> duplicate definition of struct page, it just allocated 'struct page'
> >> instead of allocating another struct that is identical to struct page
> >> and casting it into struct page.
>
> Perhaps it is more accurate to say this is something between RFC v1 and
> RFC v3, in order to decouple 'struct page' for devmem from mm subsystem,
> but still have most unified handling for both normal memory and devmem
> in page pool and net stack.
>
> The main difference between this patchset and RFC v1:
> 1. The mm subsystem is not supposed to see the 'struct page' for devmem
> in this patchset, I guess we could say it is decoupled from the mm
> subsystem even though we still call PageTail()/page_ref_count()/
> page_is_pfmemalloc() on 'struct page' for devmem.
>

In this patchset you pretty much allocate a struct page for your
dma-buf memory, and then cast it into a struct page, so all the mm
calls in page_pool.c are seeing a struct page when it's really dma-buf
memory.

'even though we still call
PageTail()/page_ref_count()/page_is_pfmemalloc() on 'struct page' for
devmem' is basically making dma-buf memory look like struct pages.

Actually because you put the 'strtuct page for devmem' in
skb->bv_frag, the net stack will grab the 'struct page' for devmem
using skb_frag_page() then call things like page_address(), kmap,
get_page, put_page, etc, etc, etc.

> The main difference between this patchset and RFC v3:
> 1. It reuses the 'struct page' to have more unified handling between
> normal page and devmem page for net stack.

This is what was nacked in RFC v1.

> 2. It relies on the page->pp_frag_count to do reference counting.
>

I don't see you change any of the page_ref_* calls in page_pool.c, for
example this one:

https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601

So the reference the page_pool is seeing is actually page->_refcount,
not page->pp_frag_count? I'm confused here. Is this a bug in the
patchset?

> >>
> >> I don't think what you're doing here reverses the nacks I got in RFC
> >> v1. You also did not CC any dma-buf or mm people on this proposal that
> >> would bring up these concerns again.
> >
> > Right, but the mirror struct has some appeal to a non-mm person like
> > myself. The problem IIUC is that this patch is the wrong way around, we
> > should be converting everyone who can deal with non-host mem to struct
> > page_pool_iov. Using page_address() on ppiov which hns3 seems to do in
> > this series does not compute for me.
>
> The hacking use of ppiov in hns3 is only used to do the some prototype
> testing, so ignore it.
>
> >
> > Then we can turn the existing non-iov helpers to be a thin wrapper with
> > just a cast from struct page to struct page_pool_iov, and a call of the
> > iov helper. Again - never cast the other way around.
>
> I am agreed that a cast from struct page to struct page_pool_iov is allowed,
> but a cast from struct page_pool_iov to struct page is not allowed if I am
> understanding you correctly.
>
> Before we can also completely decouple 'struct page' allocated using buddy
> allocator directly from mm subsystem in netstack, below is what I have in
> mind in order to support different memory provider.
>
> +--------------+
> | Netstack |
> |'struct page' |
> +--------------+
> ^
> |
> |
> v
> +---------------------+
> +----------------------+ | | +---------------+
> | devmem MP |<---->| Page pool |----->| **** MP |
> |'struct page_pool_iov'| | 'struct page' | |'struct **_iov'|
> +----------------------+ | | +---------------+
> +---------------------+
> ^
> |
> |
> v
> +---------------+
> | Driver |
> | 'struct page' |
> +---------------+
>
> I would expect net stack, page pool, driver still see the 'struct page',
> only memory provider see the specific struct for itself, for the above,
> devmem memory provider sees the 'struct page_pool_iov'.
>
> The reason I still expect driver to see the 'struct page' is that driver
> will still need to support normal memory besides devmem.
>
> >
> > Also I think this conversion can be done completely separately from the
> > mem provider changes. Just add struct page_pool_iov and start using it.
>
> I am not sure I understand what does "Just add struct page_pool_iov and
> start using it" mean yet.
>
> >
> > Does that make more sense?
> >
> > .
> >



--
Thanks,
Mina