Re: [RFC 1/5] vhost: reorder used descriptors in a batch

From: Guo Zhi
Date: Thu Aug 11 2022 - 04:58:49 EST




----- Original Message -----
> From: "jasowang" <jasowang@xxxxxxxxxx>
> To: "Guo Zhi" <qtxuning1999@xxxxxxxxxxx>
> Cc: "eperezma" <eperezma@xxxxxxxxxx>, "sgarzare" <sgarzare@xxxxxxxxxx>, "Michael Tsirkin" <mst@xxxxxxxxxx>, "netdev"
> <netdev@xxxxxxxxxxxxxxx>, "linux-kernel" <linux-kernel@xxxxxxxxxxxxxxx>, "kvm list" <kvm@xxxxxxxxxxxxxxx>,
> "virtualization" <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>
> Sent: Thursday, August 4, 2022 1:04:16 PM
> Subject: Re: [RFC 1/5] vhost: reorder used descriptors in a batch

> On Tue, Aug 2, 2022 at 10:12 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote:
>>
>>
>>
>> ----- Original Message -----
>> > From: "jasowang" <jasowang@xxxxxxxxxx>
>> > To: "Guo Zhi" <qtxuning1999@xxxxxxxxxxx>
>> > Cc: "eperezma" <eperezma@xxxxxxxxxx>, "sgarzare" <sgarzare@xxxxxxxxxx>, "Michael
>> > Tsirkin" <mst@xxxxxxxxxx>, "netdev"
>> > <netdev@xxxxxxxxxxxxxxx>, "linux-kernel" <linux-kernel@xxxxxxxxxxxxxxx>, "kvm
>> > list" <kvm@xxxxxxxxxxxxxxx>,
>> > "virtualization" <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>
>> > Sent: Friday, July 29, 2022 3:32:02 PM
>> > Subject: Re: [RFC 1/5] vhost: reorder used descriptors in a batch
>>
>> > On Thu, Jul 28, 2022 at 4:26 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote:
>> >>
>> >> On 2022/7/26 15:36, Jason Wang wrote:
>> >>
>> >>
>> >> 在 2022/7/21 16:43, Guo Zhi 写道:
>> >>
>> >> Device may not use descriptors in order, for example, NIC and SCSI may
>> >> not call __vhost_add_used_n with buffers in order. It's the task of
>> >> __vhost_add_used_n to order them.
>> >>
>> >>
>> >>
>> >> I'm not sure this is ture. Having ooo descriptors is probably by design to have
>> >> better performance.
>> >>
>> >> This might be obvious for device that may have elevator or QOS stuffs.
>> >>
>> >> I suspect the right thing to do here is, for the device that can't perform
>> >> better in the case of IN_ORDER, let's simply not offer IN_ORDER (zerocopy or
>> >> scsi). And for the device we know it can perform better, non-zercopy ethernet
>> >> device we can do that.
>> >>
>> >>
>> >> This commit reorder the buffers using
>> >> vq->heads, only the batch is begin from the expected start point and is
>> >> continuous can the batch be exposed to driver. And only writing out a
>> >> single used ring for a batch of descriptors, according to VIRTIO 1.1
>> >> spec.
>> >>
>> >>
>> >>
>> >> So this sounds more like a "workaround" of the device that can't consume buffer
>> >> in order, I suspect it can help in performance.
>> >>
>> >> More below.
>> >>
>> >>
>> >>
>> >> Signed-off-by: Guo Zhi <qtxuning1999@xxxxxxxxxxx>
>> >> ---
>> >> drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++++++++++++++--
>> >> drivers/vhost/vhost.h | 3 +++
>> >> 2 files changed, 45 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> >> index 40097826c..e2e77e29f 100644
>> >> --- a/drivers/vhost/vhost.c
>> >> +++ b/drivers/vhost/vhost.c
>> >> @@ -317,6 +317,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>> >> vq->used_flags = 0;
>> >> vq->log_used = false;
>> >> vq->log_addr = -1ull;
>> >> + vq->next_used_head_idx = 0;
>> >> vq->private_data = NULL;
>> >> vq->acked_features = 0;
>> >> vq->acked_backend_features = 0;
>> >> @@ -398,6 +399,8 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>> >> GFP_KERNEL);
>> >> if (!vq->indirect || !vq->log || !vq->heads)
>> >> goto err_nomem;
>> >> +
>> >> + memset(vq->heads, 0, sizeof(*vq->heads) * dev->iov_limit);
>> >> }
>> >> return 0;
>> >> @@ -2374,12 +2377,49 @@ static int __vhost_add_used_n(struct vhost_virtqueue
>> >> *vq,
>> >> unsigned count)
>> >> {
>> >> vring_used_elem_t __user *used;
>> >> + struct vring_desc desc;
>> >> u16 old, new;
>> >> int start;
>> >> + int begin, end, i;
>> >> + int copy_n = count;
>> >> +
>> >> + if (vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) {
>> >>
>> >>
>> >>
>> >> How do you guarantee that ids of heads are contiguous?
>> >>
>> >> There is no need to be contiguous for ids of heads.
>> >>
>> >> For example, I have three buffer { .id = 0, 15}, {.id = 20, 30} {.id = 15, 20}
>> >> for vhost_add_used_n. Then I will let the vq->heads[0].len=15.
>> >> vq->heads[15].len=5, vq->heads[20].len=10 as reorder. Once I found there is no
>> >> hold in the batched descriptors. I will expose them to driver.
>> >
>> > So spec said:
>> >
>> > "If VIRTIO_F_IN_ORDER has been negotiated, driver uses descriptors in
>> > ring order: starting from offset 0 in the table, and wrapping around
>> > at the end of the table."
>> >
>> > And
>> >
>> > "VIRTIO_F_IN_ORDER(35)This feature indicates that all buffers are used
>> > by the device in the same order in which they have been made
>> > available."
>> >
>> > This means your example is not an IN_ORDER device.
>> >
>> > The driver should submit buffers (assuming each buffer have one
>> > descriptor) in order {id = 0, 15}, {id = 1, 30} and {id = 2, 20}.
>> >
>> > And even if it is submitted in order, we can not use a batch because:
>> >
>> > "The skipped buffers (for which no used ring entry was written) are
>> > assumed to have been used (read or written) by the device completely."
>> >
>> > This means for TX we are probably ok, but for rx, unless we know the
>> > buffers were written completely, we can't write them in a batch.
>> >
>> > I'd suggest to do cross testing for this series:
>> >
>> > 1) testing vhost IN_ORDER support with DPDK virtio PMD
>> > 2) testing virtio IN_ORDER with DPDK vhost-user via testpmd
>> >
>> > Thanks
>> >
>> You are correct, for rx we can't do a batch because we have to let the driver
>> know the length of buffers.
>
> Note that we can do a batch for rx when we know all the buffers have
> been fully written.
>
>>
>> I think these circumstances can offer batch:
>> 1. tx
>> 2. rx with RX_MRGBUF feature, which introduce a header for each received buffer
>>
>> Consider batch is not a mandatory requirement for in order feature according to
>> spec.
>> I'd like to let current RFC patch focus on in order implementation, and send
>> another
>> patch series to improve performance by batching on above circumstances.
>
> That's fine, how about simply starting from the patch that offers
> IN_ORDER when zerocopy is disabled?
>

Yeah, I'd like to start from vsock device, which doesn't use zerocopy

Thanks
> Thanks
>
>>
>> What's your opinon.
>>
>> Thanks
>> >
>> >>
>> >>
>> >> + /* calculate descriptor chain length for each used buffer */
>> >>
>> >>
>> >>
>> >> I'm a little bit confused about this comment, we have heads[i].len for this?
>> >>
>> >> Maybe I should not use vq->heads, some misleading.
>> >>
>> >>
>> >> + for (i = 0; i < count; i++) {
>> >> + begin = heads[i].id;
>> >> + end = begin;
>> >> + vq->heads[begin].len = 0;
>> >>
>> >>
>> >>
>> >> Does this work for e.g RX virtqueue?
>> >>
>> >>
>> >> + do {
>> >> + vq->heads[begin].len += 1;
>> >> + if (unlikely(vhost_get_desc(vq, &desc, end))) {
>> >>
>> >>
>> >>
>> >> Let's try hard to avoid more userspace copy here, it's the source of performance
>> >> regression.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> + vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
>> >> + end, vq->desc + end);
>> >> + return -EFAULT;
>> >> + }
>> >> + } while ((end = next_desc(vq, &desc)) != -1);
>> >> + }
>> >> +
>> >> + count = 0;
>> >> + /* sort and batch continuous used ring entry */
>> >> + while (vq->heads[vq->next_used_head_idx].len != 0) {
>> >> + count++;
>> >> + i = vq->next_used_head_idx;
>> >> + vq->next_used_head_idx = (vq->next_used_head_idx +
>> >> + vq->heads[vq->next_used_head_idx].len)
>> >> + % vq->num;
>> >> + vq->heads[i].len = 0;
>> >> + }
>> >> + /* only write out a single used ring entry with the id corresponding
>> >> + * to the head entry of the descriptor chain describing the last buffer
>> >> + * in the batch.
>> >> + */
>> >> + heads[0].id = i;
>> >> + copy_n = 1;
>> >> + }
>> >> start = vq->last_used_idx & (vq->num - 1);
>> >> used = vq->used->ring + start;
>> >> - if (vhost_put_used(vq, heads, start, count)) {
>> >> + if (vhost_put_used(vq, heads, start, copy_n)) {
>> >> vq_err(vq, "Failed to write used");
>> >> return -EFAULT;
>> >> }
>> >> @@ -2410,7 +2450,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct
>> >> vring_used_elem *heads,
>> >> start = vq->last_used_idx & (vq->num - 1);
>> >> n = vq->num - start;
>> >> - if (n < count) {
>> >> + if (n < count && !vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) {
>> >> r = __vhost_add_used_n(vq, heads, n);
>> >> if (r < 0)
>> >> return r;
>> >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> >> index d9109107a..7b2c0fbb5 100644
>> >> --- a/drivers/vhost/vhost.h
>> >> +++ b/drivers/vhost/vhost.h
>> >> @@ -107,6 +107,9 @@ struct vhost_virtqueue {
>> >> bool log_used;
>> >> u64 log_addr;
>> >> + /* Sort heads in order */
>> >> + u16 next_used_head_idx;
>> >> +
>> >> struct iovec iov[UIO_MAXIOV];
>> >> struct iovec iotlb_iov[64];
>> >> struct iovec *indirect;
>> >>
>> >>
>> >>
>>