Re: [PATCH v16 net-next 1/1] hv_sock: introduce Hyper-V Sockets

From: Michal Kubecek
Date: Wed Jul 13 2016 - 03:20:37 EST


On Mon, Jul 11, 2016 at 02:56:52PM +0000, Dexuan Cui wrote:
> Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> mechanism between the host and the guest. It's somewhat like TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.
>
> With Hyper-V Sockets, applications between the host and the guest can talk
> to each other directly by the traditional BSD-style socket APIs.
>
> Hyper-V Sockets is only available on new Windows hosts, like Windows Server
> 2016. More info is in this article "Make your own integration services":
> https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service
>
> The patch implements the necessary support in the guest side by introducing
> a new socket address family AF_HYPERV.
>
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Cc: "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>
> Cc: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>
> Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> Cc: Cathy Avery <cavery@xxxxxxxxxx>
> Cc: Olaf Hering <olaf@xxxxxxxxx>
> ---
>
> You can also get the patch by (commit 5dde7975):
> https://github.com/dcui/linux/tree/decui/hv_sock/net-next/20160711_v16
>
>
> For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31
>
> In v12, the changes are mainly the following:
>
> 1) remove the module params as David suggested.
>
> 2) use 5 exact pages for VMBus send/recv rings, respectively.
> The host side's design of the feature requires 5 exact pages for recv/send
> rings respectively -- this is suboptimal considering memory consumption,
> however unluckily we have to live with it, before the host comes up with
> a new design in the future. :-(
>
> 3) remove the per-connection static send/recv buffers
> Instead, we allocate and free the buffers dynamically only when we recv/send
> data. This means: when a connection is idle, no memory is consumed as
> recv/send buffers at all.
>
> In v13:
> I return ENOMEM on buffer alllocation failure
>
> Actually "man read/write" says "Other errors may occur, depending on the
> object connected to fd". "man send/recv" indeed lists ENOMEM.
> Considering AF_HYPERV is a new socket type, ENOMEM seems OK here.
> In the long run, I think we should add a new API in the VMBus driver,
> allowing data copy from VMBus ringbuffer into user mode buffer directly.
> This way, we can even eliminate this temporary buffer.
>
> In v14:
> fix some coding style issues pointed out by David.
>
> In v15:
> Just some stylistic changes addressing comments from Joe Perches and
> Olaf Hering -- thank you!
> - add a GPL blurb.
> - define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
> - change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
> - remove a not-very-useful pr_err()
> - fix some typos in comment and coding style issues.
>
> In v16:
> Made stylistic changes addressing comments from Vitaly Kuznetsov.
> Thank you very much for the detailed comments, Vitaly!
>
> Looking forward to your comments!
>
> MAINTAINERS | 2 +
> include/linux/hyperv.h | 13 +
> include/linux/socket.h | 4 +-
> include/net/af_hvsock.h | 78 +++
> include/uapi/linux/hyperv.h | 23 +
> net/Kconfig | 1 +
> net/Makefile | 1 +
> net/hv_sock/Kconfig | 10 +
> net/hv_sock/Makefile | 3 +
> net/hv_sock/af_hvsock.c | 1509 +++++++++++++++++++++++++++++++++++++++++++
> 10 files changed, 1643 insertions(+), 1 deletion(-)

...

> +static LIST_HEAD(hvsock_bound_list);
> +static LIST_HEAD(hvsock_connected_list);
> +static DEFINE_MUTEX(hvsock_mutex);
> +
> +static struct sock *hvsock_find_bound_socket(const struct sockaddr_hv *addr)
> +{
> + struct hvsock_sock *hvsk;
> +
> + list_for_each_entry(hvsk, &hvsock_bound_list, bound_list) {
> + if (!uuid_le_cmp(addr->shv_service_guid,
> + hvsk->local_addr.shv_service_guid))
> + return hvsock_to_sk(hvsk);
> + }
> + return NULL;
> +}
> +
> +static struct sock *hvsock_find_connected_socket_by_channel(
> + const struct vmbus_channel *channel)
> +{
> + struct hvsock_sock *hvsk;
> +
> + list_for_each_entry(hvsk, &hvsock_connected_list, connected_list) {
> + if (hvsk->channel == channel)
> + return hvsock_to_sk(hvsk);
> + }
> + return NULL;
> +}

How does this work from performance point of view if there are many
connected sockets and/or high frequency of new connections? AFAICS most
other families use a hash table for socket lookup.

> +static void get_ringbuffer_rw_status(struct vmbus_channel *channel,
> + bool *can_read, bool *can_write)
> +{
> + u32 avl_read_bytes, avl_write_bytes, dummy;
> +
> + if (can_read) {
> + hv_get_ringbuffer_availbytes(&channel->inbound,
> + &avl_read_bytes,
> + &dummy);
> + /* 0-size payload means FIN */
> + *can_read = avl_read_bytes >= HVSOCK_PKT_LEN(0);
> + }
> +
> + if (can_write) {
> + hv_get_ringbuffer_availbytes(&channel->outbound,
> + &dummy,
> + &avl_write_bytes);
> +
> + /* We only write if there is enough space */
> + *can_write = avl_write_bytes > HVSOCK_PKT_LEN(PAGE_SIZE);

I'm not sure where does this come from but is this really supposed to be
PAGE_SIZE (not the fixed 4KB PAGE_SIZE_4K)?

> +static int hvsock_open_connection(struct vmbus_channel *channel)
> +{
> + struct hvsock_sock *hvsk, *new_hvsk;
> + uuid_le *instance, *service_id;
> + unsigned char conn_from_host;
> + struct sockaddr_hv hv_addr;
> + struct sock *sk, *new_sk;
> + int ret;
> +
> + instance = &channel->offermsg.offer.if_instance;
> + service_id = &channel->offermsg.offer.if_type;
> +
> + /* The first byte != 0 means the host initiated the connection. */
> + conn_from_host = channel->offermsg.offer.u.pipe.user_def[0];
> +
> + mutex_lock(&hvsock_mutex);
> +
> + hvsock_addr_init(&hv_addr, conn_from_host ? *service_id : *instance);
> + sk = hvsock_find_bound_socket(&hv_addr);
> +
> + if (!sk || (conn_from_host && sk->sk_state != SS_LISTEN) ||
> + (!conn_from_host && sk->sk_state != SS_CONNECTING)) {
> + ret = -ENXIO;
> + goto out;
> + }
> +
> + if (conn_from_host) {
> + if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
> + ret = -ECONNREFUSED;
> + goto out;
> + }
> +
> + new_sk = hvsock_create(sock_net(sk), NULL, GFP_KERNEL,
> + sk->sk_type);
> + if (!new_sk) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + new_sk->sk_state = SS_CONNECTING;
> + new_hvsk = sk_to_hvsock(new_sk);
> + new_hvsk->channel = channel;
> + hvsock_addr_init(&new_hvsk->local_addr, *service_id);
> + hvsock_addr_init(&new_hvsk->remote_addr, *instance);
> + } else {
> + hvsk = sk_to_hvsock(sk);
> + hvsk->channel = channel;
> + }
> +
> + set_channel_read_state(channel, false);
> + ret = vmbus_open(channel, RINGBUFFER_HVSOCK_SND_SIZE,
> + RINGBUFFER_HVSOCK_RCV_SIZE, NULL, 0,
> + hvsock_on_channel_cb, conn_from_host ? new_sk : sk);
> + if (ret != 0) {
> + if (conn_from_host) {
> + new_hvsk->channel = NULL;
> + sock_put(new_sk);
> + } else {
> + hvsk->channel = NULL;
> + }
> + goto out;
> + }
> +
> + vmbus_set_chn_rescind_callback(channel, hvsock_close_connection);
> +
> + /* see get_ringbuffer_rw_status() */
> + set_channel_pending_send_size(channel, HVSOCK_PKT_LEN(PAGE_SIZE) + 1);

Same question.

> +
> + if (conn_from_host) {
> + new_sk->sk_state = SS_CONNECTED;
> +
> + sock_hold(&new_hvsk->sk);
> + list_add(&new_hvsk->connected_list, &hvsock_connected_list);
> +
> + hvsock_enqueue_accept(sk, new_sk);
> + } else {
> + sk->sk_state = SS_CONNECTED;
> + sk->sk_socket->state = SS_CONNECTED;
> +
> + sock_hold(&hvsk->sk);
> + list_add(&hvsk->connected_list, &hvsock_connected_list);
> + }
> +
> + sk->sk_state_change(sk);
> +out:
> + mutex_unlock(&hvsock_mutex);
> + return ret;
> +}

...

> +static int hvsock_create_sock(struct net *net, struct socket *sock,
> + int protocol, int kern)
> +{
> + struct sock *sk;
> +
> + if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
> + return -EPERM;

Looks like any application wanting to use hyper-v sockets will need
rather high privileges. It would make sense if these sockets were
reserved for privileged tasks like VM management. But according to the
commit message, hv_sock is supposed to be used for regular application
to application communication. Requiring CAP_{SYS,NET}_ADMIN looks like
an overkill to me.

> +
> + if (protocol != 0 && protocol != SHV_PROTO_RAW)
> + return -EPROTONOSUPPORT;
> +
> + switch (sock->type) {
> + case SOCK_STREAM:
> + sock->ops = &hvsock_ops;
> + break;
> + default:
> + return -ESOCKTNOSUPPORT;
> + }
> +
> + sock->state = SS_UNCONNECTED;
> +
> + sk = hvsock_create(net, sock, GFP_KERNEL, 0);
> + return sk ? 0 : -ENOMEM;
> +}

Michal Kubecek