Re: [PATCH v5] virtio/vsock: replace virtio_vsock_pkt with sk_buff

From: Bobby Eshleman
Date: Tue Dec 06 2022 - 13:07:11 EST


On Tue, Dec 06, 2022 at 11:20:21AM +0100, Paolo Abeni wrote:
> Hello,
>
> On Fri, 2022-12-02 at 09:35 -0800, Bobby Eshleman wrote:
> [...]
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index 35d7eedb5e8e..6c0b2d4da3fe 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -3,10 +3,129 @@
> > #define _LINUX_VIRTIO_VSOCK_H
> >
> > #include <uapi/linux/virtio_vsock.h>
> > +#include <linux/bits.h>
> > #include <linux/socket.h>
> > #include <net/sock.h>
> > #include <net/af_vsock.h>
> >
> > +#define VIRTIO_VSOCK_SKB_HEADROOM (sizeof(struct virtio_vsock_hdr))
> > +
> > +enum virtio_vsock_skb_flags {
> > + VIRTIO_VSOCK_SKB_FLAGS_REPLY = BIT(0),
> > + VIRTIO_VSOCK_SKB_FLAGS_TAP_DELIVERED = BIT(1),
> > +};
> > +
> > +static inline struct virtio_vsock_hdr *virtio_vsock_hdr(struct sk_buff *skb)
> > +{
> > + return (struct virtio_vsock_hdr *)skb->head;
> > +}
> > +
> > +static inline bool virtio_vsock_skb_reply(struct sk_buff *skb)
> > +{
> > + return skb->_skb_refdst & VIRTIO_VSOCK_SKB_FLAGS_REPLY;
> > +}
>
> I'm sorry for the late feedback. The above is extremelly risky: if the
> skb will land later into the networking stack, we could experience the
> most difficult to track bugs.
>
> You should use the skb control buffer instead (skb->cb), with the
> additional benefit you could use e.g. bool - the compiler could emit
> better code to manipulate such fields - and you will not need to clear
> the field before release nor enqueue.
>
> [...]
>

Hey Paolo, thank you for the review. For my own learning, this would
happen presumably when the skb is dropped? And I assume we don't see
this in sockmap because it is always cleared before leaving sockmap's
hands? I sanity checked this patch with an out-of-tree patch I have that
uses the networking stack, but I suspect I didn't see issues because my
test harness didn't induce dropping...

I originally avoided skb->cb because the reply flag is set at allocation
and would potentially be clobbered by a pass through the networking
stack. The reply flag would be used after a pass through the networking
stack (e.g., during transmission at the device level and when sockets
close while skbs are still queued for xmit).

I suppose using skb->cb would look like something like this:
- use skb_clone() for reply skbs
- set reply on the cloned sk_buff skb->cb
- keep a hashmap mapping original skb to cloned skb (is there a better
way?)
- when choosing to apply reply logic, if skb_cloned() refer to the
hashmap

Is there a better/simpler way to maintain skb->cb?

> > @@ -352,37 +360,38 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> > size_t len)
> > {
> > struct virtio_vsock_sock *vvs = vsk->trans;
> > - struct virtio_vsock_pkt *pkt;
> > size_t bytes, total = 0;
> > - u32 free_space;
> > + struct sk_buff *skb;
> > int err = -EFAULT;
> > + u32 free_space;
> >
> > spin_lock_bh(&vvs->rx_lock);
> > - while (total < len && !list_empty(&vvs->rx_queue)) {
> > - pkt = list_first_entry(&vvs->rx_queue,
> > - struct virtio_vsock_pkt, list);
> > + while (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> > + skb = __skb_dequeue(&vvs->rx_queue);
>
> Here the locking schema is confusing. It looks like vvs->rx_queue is
> under vvs->rx_lock protection, so the above should be skb_queue_empty()
> instead of the lockless variant.
>
> [...]
>
> > @@ -858,16 +873,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> > static void virtio_transport_remove_sock(struct vsock_sock *vsk)
> > {
> > struct virtio_vsock_sock *vvs = vsk->trans;
> > - struct virtio_vsock_pkt *pkt, *tmp;
> >
> > /* We don't need to take rx_lock, as the socket is closing and we are
> > * removing it.
> > */
> > - list_for_each_entry_safe(pkt, tmp, &vvs->rx_queue, list) {
> > - list_del(&pkt->list);
> > - virtio_transport_free_pkt(pkt);
> > - }
> > -
> > + virtio_vsock_skb_queue_purge(&vvs->rx_queue);
>
> Still assuming rx_queue is under the rx_lock, given you don't need the
> locking here as per the above comment, you should use the lockless
> purge variant.
>

Good catch, thanks!

Thanks again,
Bobby