Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

From: Mina Almasry
Date: Mon Nov 06 2023 - 17:19:09 EST


On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@xxxxxxxxxx> wrote:
> > >
> > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > On 11/05, Mina Almasry wrote:
> > > >> For device memory TCP, we expect the skb headers to be available in host
> > > >> memory for access, and we expect the skb frags to be in device memory
> > > >> and unaccessible to the host. We expect there to be no mixing and
> > > >> matching of device memory frags (unaccessible) with host memory frags
> > > >> (accessible) in the same skb.
> > > >>
> > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > >> are device memory frags or not.
> > > >>
> > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > >> and marks the skb as skb->devmem accordingly.
> > > >>
> > > >> Add checks through the network stack to avoid accessing the frags of
> > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > >>
> > > >> Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx>
> > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@xxxxxxxxxx>
> > > >> Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> > > >>
> > > >> ---
> > > >> include/linux/skbuff.h | 14 +++++++-
> > > >> include/net/tcp.h | 5 +--
> > > >> net/core/datagram.c | 6 ++++
> > > >> net/core/gro.c | 5 ++-
> > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > >> net/ipv4/tcp.c | 6 ++++
> > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > >> net/ipv4/tcp_output.c | 5 ++-
> > > >> net/packet/af_packet.c | 4 +--
> > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > >>
> > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > >> index 1fae276c1353..8fb468ff8115 100644
> > > >> --- a/include/linux/skbuff.h
> > > >> +++ b/include/linux/skbuff.h
> > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > >> * the packet minus one that have been verified as
> > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > >> + * device memory.
> > > >> * @dst_pending_confirm: need to confirm neighbour
> > > >> * @decrypted: Decrypted SKB
> > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > >> __u8 csum_not_inet:1;
> > > >> #endif
> > > >> -
> > > >> + __u8 devmem:1;
> > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > >> __u16 tc_index; /* traffic control index */
> > > >> #endif
> > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > >> __skb_zcopy_downgrade_managed(skb);
> > > >> }
> > > >>
> > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > >> +{
> > > >> + return skb->devmem;
> > > >
> > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > It better communicates the fact that the stack shouldn't dereference the
> > > > frags (because it has 'devmem' fragments or for some other potential
> > > > future reason).
> > >
> > > +1.
> > >
> > > Also, the flag on the skb is an optimization - a high level signal that
> > > one or more frags is in unreadable memory. There is no requirement that
> > > all of the frags are in the same memory type.
>
> David: maybe there should be such a requirement (that they all are
> unreadable)? Might be easier to support initially; we can relax later
> on.
>

Currently devmem == not_readable, and the restriction is that all the
frags in the same skb must be either all readable or all unreadable
(all devmem or all non-devmem).

> > The flag indicates that the skb contains all devmem dma-buf memory
> > specifically, not generic 'not_readable' frags as the comment says:
> >
> > + * @devmem: indicates that all the fragments in this skb are backed by
> > + * device memory.
> >
> > The reason it's not a generic 'not_readable' flag is because handing
> > off a generic not_readable skb to the userspace is semantically not
> > what we're doing. recvmsg() is augmented in this patch series to
> > return a devmem skb to the user via a cmsg_devmem struct which refers
> > specifically to the memory in the dma-buf. recvmsg() in this patch
> > series is not augmented to give any 'not_readable' skb to the
> > userspace.
> >
> > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > correct. If a new type of unreadable skbs are introduced to the stack,
> > I imagine the stack would implement:
> >
> > 1. new header flag: skb->newmem
> > 2.
> >
> > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > {
> > return skb->devmem || skb->newmem;
> > }
> >
> > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
>
> You copy it to the userspace in a special way because your frags
> are page_is_page_pool_iov(). I agree with David, the skb bit is
> just and optimization.
>
> For most of the core stack, it doesn't matter why your skb is not
> readable. For a few places where it matters (recvmsg?), you can
> double-check your frags (all or some) with page_is_page_pool_iov.
>

I see, we can do that then. I.e. make the header flag 'not_readable'
and check the frags to decide to delegate to tcp_recvmsg_devmem() or
something else. We can even assume not_readable == devmem because
currently devmem is the only type of unreadable frag currently.

> Unrelated: we probably need socket to dmabuf association as well (via
> netlink or something).

Not sure this is possible. The dma-buf is bound to the rx-queue, and
any packets that land on that rx-queue are bound to that dma-buf,
regardless of which socket that packet belongs to. So the association
IMO must be rx-queue to dma-buf, not socket to dma-buf.

> We are fundamentally receiving into and sending from a dmabuf (devmem ==
> dmabuf).
> And once you have this association, recvmsg shouldn't need any new
> special flags.


--
Thanks,
Mina