Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

From: Stanislav Fomichev
Date: Tue Nov 07 2023 - 13:14:31 EST


On 11/07, Willem de Bruijn wrote:
> On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > > > it somehow implies that I have an option of passing or not passing it
> > > > > > > for an individual system call.
> > > > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > > > should move this flag to the socket() syscall?
> > > > > > >
> > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > I think it should then be a setsockopt called before any data is
> > > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > > passively opened connections, or be overly restrictive: one approach
> > > > > > for all accepted child sockets.
> > > > >
> > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > > are plenty of bits we can grab. But setsockopt works as well!
> > > >
> > > > To follow up: if we have this flag on a socket, not on a per-message
> > > > basis, can we also use recvmsg for the recycling part maybe?
> > > >
> > > > while (true) {
> > > > memset(msg, 0, ...);
> > > >
> > > > /* receive the tokens */
> > > > ret = recvmsg(fd, &msg, 0);
> > > >
> > > > /* recycle the tokens from the above recvmsg() */
> > > > ret = recvmsg(fd, &msg, MSG_RECYCLE);
> > > > }
> > > >
> > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > > to recycle a range.
> > > >
> > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > > Or is it more confusing?
> > >
> > > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> > >
> > >
> > > I am not aware of any precedent in multiplexing the data stream and a
> > > control operation stream in this manner. It would also require adding
> > > a branch in the sendmsg hot path.
> >
> > Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> > stack where we need it? Mixing in sendmsg is indeed ugly :-(
>
> I tried exactly the inverse of that when originally adding
> MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
> on sendmsg calls by writing to sendmsg msg_control on return to user.
> It required significant code churn, which the performance gains did
> not warrant. Doing so also breaks the simple rule that recv is for
> reading and send is for writing.

We're breaking so many rules here, so not sure we should be super
constrained :-D

> > Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> > this hot path, so having one extra condition shouldn't hurt too much?
>
> We're doing that in the optional cmsg handling of recvmsg, which is
> already a slow path (compared to the data read() itself).
>
> > > The memory is associated with the socket, freed when the socket is
> > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > > state operation, for which setsockopt is the socket interface.
> > >
> > > Is your request purely a dislike, or is there some technical concern
> > > with BPF and setsockopt?
> >
> > It's mostly because I've been bitten too much by custom socket options that
> > are not really on/off/update-value operations:
> >
> > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
> >
> > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> > things tend to evolve and change.
>
> I see. I'm a bit concerned if we start limiting what we can do in
> sockets because of dependencies that BPF processing places on them.
> The use case for BPF [gs]etsockopt is limited to specific control mode
> calls. Would it make sense to just exclude calls like
> SO_DEVMEM_DONTNEED from this interpositioning?

Yup, that's why I'm asking. We already have ->bpf_bypass_getsockopt()
to special-case tcp zerocopy. We might add another bpf_bypass_setsockopt
to special case SO_DEVMEM_DONTNEED. That's why I'm trying to see if
there is a better alternative.

> At a high level what we really want is a high rate metadata path from
> user to kernel. And there are no perfect solutions. From kernel to
> user we use the socket error queue for this. That was never intended
> for high event rate itself, dealing with ICMP errors and the like
> before timestamps and zerocopy notifications were added.
>
> If I squint hard enough I can see some prior art in mixing data and
> high rate state changes within the same channel in NIC descriptor
> queues, where some devices do this, e.g., { "insert encryption key",
> "send packet" }. But fundamentally I think we should keep the socket
> queues for data only.

+1, we keep taking an easy route with using sockopt for this :-(

Anyway, let's see if any better suggestions pop up. Worst case - we stick
with a socket option and will add a bypass on the bpf side.