Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

From: Mina Almasry
Date: Thu Jan 18 2024 - 08:57:16 EST


On Wed, Jan 17, 2024 at 10:54 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> Mina Almasry wrote:
> > On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <almasrymina@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > > > >>>> You did not answer my question that I asked here, and ignoring this
> > > > > >>>> question is preventing us from making any forward progress on this
> > > > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > > > >>>> there is no page in the frag?
> > > > > >>>
> > > > > >>> I would expect it to do nothing.
> > > > > >>
> > > > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > > > >> results in a compiler error as the function needs to return a page
> > > > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > > > >> cast frag->netmem to a page pointer? That was explained as
> > > > > >> unacceptable over and over again by Jason and Christian as it risks
> > > > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > > > >> not get nacked by mm?
> > > > > >
> > > > > > WARN_ON and return NULL seems reasonable?
> > > > >
> > >
> > > That's more or less what I'm thinking.
> > >
> > > > > While I am agreed that it may be a nightmare to debug the case of passing
> > > > > a false page into the mm system, but I am not sure what's the point of
> > > > > returning NULL to caller if the caller is not expecting or handling
> > > > > the
> > > >
> > > > You have to return something and NULL will largely reliably crash the
> > > > thread. The WARN_ON explains in detail why your thread just crashed.
> > > >
> > >
> > > Agreed.
> > >
> > > > > NULL returning[for example, most of mm API called by the networking does not
> > > > > seems to handling NULL as input page], isn't the NULL returning will make
> > > > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > > > As returning NULL seems to be causing a confusion for the caller of
> > > > > skb_frag_page() as whether to or how to handle the NULL returning case.
> > > >
> > > > Possibly, though Linus doesn't like BUG_ON on principle..
> > > >
> > > > I think the bigger challenge is convincing people that this devmem
> > > > stuff doesn't just open a bunch of holes in the kernel where userspace
> > > > can crash it.
> > > >
> > >
> > > It does not, and as of right now there are no pending concerns from
> > > any netdev maintainers regarding mishandled devmem checks at least.
> > > This is because the devmem series comes with a full audit of
> > > skb_frag_page() callers [1] and all areas in the net stack attempting
> > > to access the skb [2].
> > >
> > > [1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@xxxxxxxxxx/
> > > [2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@xxxxxxxxxx/
> > >
> > > > The fact you all are debating what to do with skb_frag_page() suggests
> > > > to me there isn't confidence...
> > > >
> > >
> > > The debate raging on is related to the performance of skb_frag_page(),
> > > not correctness (and even then, I don't think it's related to
> > > perf...). Yunsheng would like us to optimize skb_frag_page() using an
> > > unconditional cast from netmem to page. This in Yunsheng's mind is a
> > > performance optimization as we don't need to add an if statement
> > > checking if the netmem is a page. I'm resistant to implement that
> > > change so far because:
> > >
> > > (a) unconditionally casting from netmem to page negates the compiler
> > > type safety that you and Christian are laying out as a requirement for
> > > the devmem stuff.
> > > (b) With likely/unlikely or static branches the check to make sure
> > > netmem is page is a no-op for existing use cases anyway, so AFAIU,
> > > there is no perf gain from optimizing it out anyway.
> > >
> >
> > Another thought, if anyone else is concerned about the devmem checks
> > performance, potentially we could introduce CONFIG_NET_DEVMEM which
> > when disabled prevents the user from using devmem at all (disables the
> > netlink API).
> >
> > When that is disabled, skb_frag_page(), netmem_to_page() & friends can
> > assume netmem is always a page and do a straight cast between netmem &
> > page. When it's enabled, it will check that netmem == page before
> > doing a cast, and return NULL if it is not a page.
> >
> > I think this is technically viable and I think preserves the compiler
> > type safety requirements set by mm folks. From my POV though, bloating
> > the kernel with a a new CONFIG just to optimize out no-op checks seems
> > unnecessary, but if there is agreement that the checks are a concern,
> > adding CONFIG_NET_DEVMEM should address it while being acceptable to
> > mm maintainers.
>
> I agree. A concern with CONFIGs is that what matters in practice is
> which default the distros compile with. In this case, adding hurdles
> to using the feature likely for no real reason.
>
> Static branches are used throughout the kernel in performance
> sensitive paths, exactly because they allow optional paths effectively
> for free. I'm quite surprised that this issue is being raised so
> strongly here, as they are hardly new or controversial.
>
> But perhaps best is to show with data. Is there a representative page
> pool benchmark that exercises the most sensitive case (XDP_DROP?) that
> we can run with and without a static branch to demonstrate that any
> diff is within the noise?
>

Yes, Jesper has a page_pool benchmark that he pointed me to in RFC v2:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

When running the results on RFCv2, the results were:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

On RFC v2 the benchmark showed only a single instruction regression in
the page_pool fast path & the change deemed acceptable to Jesper from
a performance POV [1].

I have not run the benchmark continually on follow up iterations of
the RFC, but I think I'll start doing so in the future.

[1] https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@xxxxxxxxxx/

> > > But none of this is related to correctness. Code calling
> > > skb_frag_page() will fail or crash if it's not handled correctly
> > > regardless of the implementation details of skb_frag_page(). In the
> > > devmem series we add support to handle it correctly via [1] & [2].
> > >
> > > --
> > > Thanks,
> > > Mina
> >
> >
> >
> > --
> > Thanks,
> > Mina
>
>


--
Thanks,
Mina