Re: [PATCH net-next] vhost_net: batch used ring update in rx

From: Michael S. Tsirkin
Date: Tue Jan 09 2018 - 10:45:02 EST


On Tue, Jan 09, 2018 at 06:27:45PM +0800, Jason Wang wrote:
> This patch tries to batched used ring update during RX. This is pretty
> fit for the case when guest is much faster (e.g dpdk based
> backend). In this case, used ring is almost empty:
>
> - we may get serious cache line misses/contending on both used ring
> and used idx.
> - at most 1 packet could be dequeued at one time, batching in guest
> does not make much effect.
>
> Update used ring in a batch can help since guest won't access the used
> ring until used idx was advanced for several descriptors and since we
> advance used ring for every N packets, guest will only need to access
> used idx for every N packet since it can cache the used idx. To have a
> better interaction for both batch dequeuing and dpdk batching,
> VHOST_RX_BATCH was used as the maximum number of descriptors that
> could be batched.
>
> Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU
> E5-2630 connected back to back through ixgbe. Traffic were generated
> on one remote ixgbe through MoonGen and measure the RX pps through
> testpmd in guest when do xdp_redirect_map from local ixgbe to
> tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31%
> improvement).
>
> One possible concern for this is the implications for TCP (especially
> latency sensitive workload). Result[1] does not show obvious changes
> for most of the netperf test (RR, TX, and RX). And we do get some
> improvements for RX on some specific size.
>
> Guest RX:
>
> size/sessions/+thu%/+normalize%
> 64/ 1/ +2%/ +2%
> 64/ 2/ +2%/ -1%
> 64/ 4/ +1%/ +1%
> 64/ 8/ 0%/ 0%
> 256/ 1/ +6%/ -3%
> 256/ 2/ -3%/ +2%
> 256/ 4/ +11%/ +11%
> 256/ 8/ 0%/ 0%
> 512/ 1/ +4%/ 0%
> 512/ 2/ +2%/ +2%
> 512/ 4/ 0%/ -1%
> 512/ 8/ -8%/ -8%
> 1024/ 1/ -7%/ -17%
> 1024/ 2/ -8%/ -7%
> 1024/ 4/ +1%/ 0%
> 1024/ 8/ 0%/ 0%
> 2048/ 1/ +30%/ +14%
> 2048/ 2/ +46%/ +40%
> 2048/ 4/ 0%/ 0%
> 2048/ 8/ 0%/ 0%
> 4096/ 1/ +23%/ +22%
> 4096/ 2/ +26%/ +23%
> 4096/ 4/ 0%/ +1%
> 4096/ 8/ 0%/ 0%
> 16384/ 1/ -2%/ -3%
> 16384/ 2/ +1%/ -4%
> 16384/ 4/ -1%/ -3%
> 16384/ 8/ 0%/ -1%
> 65535/ 1/ +15%/ +7%
> 65535/ 2/ +4%/ +7%
> 65535/ 4/ 0%/ +1%
> 65535/ 8/ 0%/ 0%
>
> TCP_RR:
>
> size/sessions/+thu%/+normalize%
> 1/ 1/ 0%/ +1%
> 1/ 25/ +2%/ +1%
> 1/ 50/ +4%/ +1%
> 64/ 1/ 0%/ -4%
> 64/ 25/ +2%/ +1%
> 64/ 50/ 0%/ -1%
> 256/ 1/ 0%/ 0%
> 256/ 25/ 0%/ 0%
> 256/ 50/ +4%/ +2%
>
> Guest TX:
>
> size/sessions/+thu%/+normalize%
> 64/ 1/ +4%/ -2%
> 64/ 2/ -6%/ -5%
> 64/ 4/ +3%/ +6%
> 64/ 8/ 0%/ +3%
> 256/ 1/ +15%/ +16%
> 256/ 2/ +11%/ +12%
> 256/ 4/ +1%/ 0%
> 256/ 8/ +5%/ +5%
> 512/ 1/ -1%/ -6%
> 512/ 2/ 0%/ -8%
> 512/ 4/ -2%/ +4%
> 512/ 8/ +6%/ +9%
> 1024/ 1/ +3%/ +1%
> 1024/ 2/ +3%/ +9%
> 1024/ 4/ 0%/ +7%
> 1024/ 8/ 0%/ +7%
> 2048/ 1/ +8%/ +2%
> 2048/ 2/ +3%/ -1%
> 2048/ 4/ -1%/ +11%
> 2048/ 8/ +3%/ +9%
> 4096/ 1/ +8%/ +8%
> 4096/ 2/ 0%/ -7%
> 4096/ 4/ +4%/ +4%
> 4096/ 8/ +2%/ +5%
> 16384/ 1/ -3%/ +1%
> 16384/ 2/ -1%/ -12%
> 16384/ 4/ -1%/ +5%
> 16384/ 8/ 0%/ +1%
> 65535/ 1/ 0%/ -3%
> 65535/ 2/ +5%/ +16%
> 65535/ 4/ +1%/ +2%
> 65535/ 8/ +1%/ -1%
>
> Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>

I keep wondering whether we want control over this
from the guest (e.g. ethtool).

And I guess UDP_RR would be a better test.

But overall I agree it's a good default.

Acked-by: Michael S. Tsirkin <mst@xxxxxxxxxx>


> ---
> drivers/vhost/net.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c7bdeb6..988af72 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -744,7 +744,7 @@ static void handle_rx(struct vhost_net *net)
> };
> size_t total_len = 0;
> int err, mergeable;
> - s16 headcount;
> + s16 headcount, nheads = 0;
> size_t vhost_hlen, sock_hlen;
> size_t vhost_len, sock_len;
> struct socket *sock;
> @@ -772,7 +772,7 @@ static void handle_rx(struct vhost_net *net)
> while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
> sock_len += sock_hlen;
> vhost_len = sock_len + vhost_hlen;
> - headcount = get_rx_bufs(vq, vq->heads, vhost_len,
> + headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
> &in, vq_log, &log,
> likely(mergeable) ? UIO_MAXIOV : 1);
> /* On error, stop handling until the next kick. */
> @@ -844,8 +844,12 @@ static void handle_rx(struct vhost_net *net)
> vhost_discard_vq_desc(vq, headcount);
> goto out;
> }
> - vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> - headcount);
> + nheads += headcount;
> + if (nheads > VHOST_RX_BATCH) {
> + vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> + nheads);
> + nheads = 0;
> + }
> if (unlikely(vq_log))
> vhost_log_write(vq, vq_log, log, vhost_len);
> total_len += vhost_len;
> @@ -856,6 +860,9 @@ static void handle_rx(struct vhost_net *net)
> }
> vhost_net_enable_vq(net, vq);
> out:
> + if (nheads)
> + vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> + nheads);
> mutex_unlock(&vq->mutex);
> }
>
> --
> 1.8.3.1