Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

From: Mina Almasry
Date: Mon Nov 06 2023 - 19:20:28 EST


On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx>
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@xxxxxxxxxx>
> > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > >> * the packet minus one that have been verified as
> > > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > >> + * device memory.
> > > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > > >> #endif
> > > > > > > > >> -
> > > > > > > > >> + __u8 devmem:1;
> > > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > > >> #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> + return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > all of the frags are in the same memory type.
> > > > > >
> > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > on.
> > > > > >
> > > > >
> > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > frags in the same skb must be either all readable or all unreadable
> > > > > (all devmem or all non-devmem).
> > > > >
> > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > >
> > > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > + * device memory.
> > > > > > >
> > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > userspace.
> > > > > > >
> > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > I imagine the stack would implement:
> > > > > > >
> > > > > > > 1. new header flag: skb->newmem
> > > > > > > 2.
> > > > > > >
> > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > {
> > > > > > > return skb->devmem || skb->newmem;
> > > > > > > }
> > > > > > >
> > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > >
> > > > > > You copy it to the userspace in a special way because your frags
> > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > just and optimization.
> > > > > >
> > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > >
> > > > >
> > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > something else. We can even assume not_readable == devmem because
> > > > > currently devmem is the only type of unreadable frag currently.
> > > > >
> > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > netlink or something).
> > > > >
> > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > regardless of which socket that packet belongs to. So the association
> > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > >
> > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > dmabuf the received token belongs to.
> > > >
> > >
> > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > the socket to that rx-queue.
> >
> > It's still fixed and won't change during the socket lifetime, right?

Technically, no.

The user is free to modify or delete flow steering rules outside of
the lifetime of the socket. Technically it's possible for the user to
reconfigure flow steering while the socket is simultaneously
receiving, and the result will be packets switching
from devmem to non-devmem. For a reasonably correctly configured
application the application would probably want to steer 1 flow to 1
dma-buf and never change it, but this is not something we enforce, but
rather the user orchestrates. In theory someone can find a use case
for configuring and unconfigure flow steering during a connection.

> > And the socket has to know this association; otherwise those tokens
> > are useless since they don't carry anything to identify the dmabuf.
> >
> > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > it somehow implies that I have an option of passing or not passing it
> > for an individual system call.

You do have the option of passing it or not passing it per system
call. The MSG_SOCK_DEVMEM says the application is willing to receive
devmem cmsgs - that's all. The application doesn't get to decide
whether it's actually going to receive a devmem cmsg or not, because
that's dictated by the type of skb that is present in the receive
queue, and not up to the application. I should explain this in the
commit message...

> > If we know that we're going to use dmabuf with the socket, maybe we
> > should move this flag to the socket() syscall?
> >
> > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> >
> > ?
>
> I think it should then be a setsockopt called before any data is
> exchanged, with no change of modifying mode later. We generally use
> setsockopts for the mode of a socket. This use of the protocol field
> in socket() for setting a mode would be novel. Also, it might miss
> passively opened connections, or be overly restrictive: one approach
> for all accepted child sockets.

We can definitely move SOCK_DEVMEM to a setsockopt(). Seems more than
reasonable.

--
Thanks,
Mina