Re: [PATCH RFC net-next v2 3/4] vsock: Add lockless sendmsg() support

From: Stefano Garzarella
Date: Wed May 03 2023 - 08:10:40 EST


On Sat, Apr 15, 2023 at 05:29:12PM +0000, Bobby Eshleman wrote:
On Fri, Apr 28, 2023 at 12:29:50PM +0200, Stefano Garzarella wrote:
On Sat, Apr 15, 2023 at 10:30:55AM +0000, Bobby Eshleman wrote:
> On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote:
> > On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
> > > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> > > it does not scale when many senders share a socket.
> > >
> > > Prior to this patch the socket lock is used to protect the local_addr,
> > > remote_addr, transport, and buffer size variables. What follows are the
> > > new protection schemes for the various protected fields that ensure a
> > > race-free multi-sender sendmsg() path for vsock dgrams.
> > >
> > > - local_addr
> > > local_addr changes as a result of binding a socket. The write path
> > > for local_addr is bind() and various vsock_auto_bind() call sites.
> > > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> > > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> > > rejects the user request and vsock_auto_bind() early exits.
> > > Therefore, the local addr can not change while a parallel thread is
> > > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> > > Change: only acquire lock for auto-binding as-needed in sendmsg().
> > >
> > > - vsk->transport
> > > Updated upon socket creation and it doesn't change again until the
> >
> > This is true only for dgram, right?
> >
>
> Yes.
>
> > How do we decide which transport to assign for dgram?
> >
>
> The transport is assigned in proto->create() [vsock_create()]. It is
> assigned there *only* for dgrams, whereas for streams/seqpackets it is
> assigned in connect(). vsock_create() sets transport to
> 'transport_dgram' if sock->type == SOCK_DGRAM.
>
> vsock_sk_destruct() then eventually sets vsk->transport to NULL.
>
> Neither destruct nor create can occur in parallel with sendmsg().
> create() hasn't yet returned the sockfd for the user to call upon it
> sendmsg(), and AFAICT destruct is only called after the final socket
> reference is released, which only happens after the socket no longer
> exists in the fd lookup table and so sendmsg() will fail before it ever
> has the chance to race.

This is okay if a socket can be assigned to a single transport, but with
dgrams I'm not sure we can do that.

Since it is not connected, a socket can send or receive packets from
different transports, so maybe we can't assign it to a specific one,
but do a lookup for every packets to understand which transport to use.


Yes this is true, this lookup needs to be implemented... currently
sendmsg() doesn't do this at all. It grabs the remote_addr when passed
in from sendto(), but then just uses the same old transport from vsk.
You are right that sendto() should be a per-packet lookup, not a
vsk->transport read. Perhaps I should add that as another patch in this
series, and make it precede this one?

Yep, I think so, we need to implement it before adding a new transport
that can support dgram.


For the send() / sendto(NULL) case where vsk->transport is being read, I
do believe this is still race-free, but...

If we later support dynamic transports for datagram, such that
connect(VMADDR_LOCAL_CID) sets vsk->transport to transport_loopback,
connect(VMADDR_CID_HOST) sets vsk->transport to something like
transport_datagram_g2h, etc..., then vsk->transport will need to be
bundled into the RCU-protected pointer too, since it may change when
remote_addr changes.

Yep, I think so. Although in vsock_dgram_connect we call lock_sock(), so maybe that could be enough to protect us.

In general I think we should use vsk->transport if vsock_dgram_connect()
is called, or we need to do per-packet lookup.

Another think I would change, is the dgram_dequeue() callback.
I would remove it, and move in the core the code in vmci_transport_dgram_dequeue() since it seems pretty generic.

This should work well if every transport uses sk_receive_skb() to enqueue sk_buffs in the socket buffer.

Thanks,
Stefano