Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

From: Jakub Kicinski
Date: Thu Jun 22 2023 - 22:11:42 EST


On Thu, 22 Jun 2023 23:54:31 +0100 David Howells wrote:
> > Maybe it's just me but I'd prefer to keep the clear rule that splice
> > operates on pages not slab objects.
>
> sendpage isn't only being used for splice(). Or were you referring to
> splicing pages into socket buffers more generally?

Yes, sorry, any sort of "zero-copy attachment of data onto a socket
queue".

> > SIW is the software / fake implementation of RDMA, right? You couldn't have
> > picked a less important user :(
>
> ISCSI and sunrpc could both make use of this, as could ceph and others. I
> have patches for sunrpc to make it condense into a single bio_vec[] and
> sendmsg() in the server code (ie. nfsd) but for the moment, Chuck wanted me to
> just do the xdr payload.

But to be clear (and I'm not implying that it's not a strong enough
reason) - the only benefit from letting someone pass headers in a slab
object is that the code already uses kmalloc(), right? IOW it could be
changed to use frags without much of a LoC bloat?

> > Maybe we can get Eric to comment. The ability to identify "frag type"
> > seems cool indeed, but I haven't thought about using it to attach
> > slab objects.
>
> Unfortunately, you can't attach slab objects. Their lifetime isn't controlled
> by put_page() or folio_put(). kmalloc()/kfree() doesn't refcount them -
> they're recycled immediately. Hence why I was copying them. (Well, you
> could attach, but then you need a callback mechanism).

Right, right, I thought you were saying that _in the future_ we may try
to attach the slab objects as frags (and presumably copy when someone
tries to ref them). Maybe I over-interpreted.

> What I'm trying to do is make it so that the process of calling sock_sendmsg()
> with MSG_SPLICE_PAGES looks exactly the same as without: You fill in a
> bio_vec[] pointing to your protocol header, the payload and the trailer,
> pointing as appropriate to bits of slab, static, stack data or ref'able pages,
> and call sendmsg and then the data will get copied or spliced as appropriate
> to the page type, whether the MSG_SPLICE_PAGES flag is supplied and whether
> the flag is supported.
>
> There are a couple of things I'd like to avoid: (1) having to call
> sock_sendmsg() more than once per message and (2) having sendmsg allocate more
> space and make a copy of data that you had to copy into a frag before calling
> sendmsg.

If we're not planning to attach the slab objects as frags, then surely
doing kmalloc() + free() in the caller, and then allocating a frag and
copying the data over in the skb / socket code is also inefficient.
Fixing the caller gives all the benefits you want, and then some.

Granted some form of alloc_skb_frag() needs to be added so that callers
don't curse us, I'd start with something based on sk_page_frag().

Or we could pull the coping out into an intermediate helper which
first replaces all slab objects in the iovec with page frags and then
calls sock_sendmsg()? Maybe that's stupid...

Let's hear what others think. If we can't reach instant agreement --
can you strategically separate out the minimal set of changes required
to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
6.5.