Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

From: Eli Cohen
Date: Thu Feb 04 2021 - 01:54:45 EST


On Wed, Feb 03, 2021 at 03:19:40PM -0800, Si-Wei Liu wrote:
> On Tue, Feb 2, 2021 at 9:16 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >
> >
> > On 2021/2/3 上午1:54, Si-Wei Liu wrote:
> > > On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen <elic@xxxxxxxxxx> wrote:
> > >> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote:
> > >>> Thanks Eli and Jason for clarifications. See inline.
> > >>>
> > >>> On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen <elic@xxxxxxxxxx> wrote:
> > >>>> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote:
> > >>>>> On 2021/2/2 下午12:15, Si-Wei Liu wrote:
> > >>>>>> On Mon, Feb 1, 2021 at 7:13 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > >>>>>>> On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> > >>>>>>>> On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu <siwliu.kernel@xxxxxxxxx> wrote:
> > >>>>>>>>> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen <elic@xxxxxxxxxx> wrote:
> > >>>>>>>>>> suspend_vq should only suspend the VQ on not save the current available
> > >>>>>>>>>> index. This is done when a change of map occurs when the driver calls
> > >>>>>>>>>> save_channel_info().
> > >>>>>>>>> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> > >>>>>>>>> which doesn't save the available index as save_channel_info() doesn't
> > >>>>>>>>> get called in that path at all. How does it handle the case that
> > >>>>>>>>> aget_vq_state() is called from userspace (e.g. QEMU) while the
> > >>>>>>>>> hardware VQ object was torn down, but userspace still wants to access
> > >>>>>>>>> the queue index?
> > >>>>>>>>>
> > >>>>>>>>> Refer to https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei.liu@xxxxxxxxxx/
> > >>>>>>>>>
> > >>>>>>>>> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
> > >>>>>>>>> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
> > >>>>>>>>>
> > >>>>>>>>> QEMU will complain with the above warning while VM is being rebooted
> > >>>>>>>>> or shut down.
> > >>>>>>>>>
> > >>>>>>>>> Looks to me either the kernel driver should cover this requirement, or
> > >>>>>>>>> the userspace has to bear the burden in saving the index and not call
> > >>>>>>>>> into kernel if VQ is destroyed.
> > >>>>>>>> Actually, the userspace doesn't have the insights whether virt queue
> > >>>>>>>> will be destroyed if just changing the device status via set_status().
> > >>>>>>>> Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like
> > >>>>>>>> so. Hence this still looks to me to be Mellanox specifics and
> > >>>>>>>> mlx5_vdpa implementation detail that shouldn't expose to userspace.
> > >>>>>>> So I think we can simply drop this patch?
> > >>>>>> Yep, I think so. To be honest I don't know why it has anything to do
> > >>>>>> with the memory hotplug issue.
> > >>>>>
> > >>>>> Eli may know more, my understanding is that, during memory hotplut, qemu
> > >>>>> need to propagate new memory mappings via set_map(). For mellanox, it means
> > >>>>> it needs to rebuild memory keys, so the virtqueue needs to be suspended.
> > >>>>>
> > >>>> I think Siwei was asking why the first patch was related to the hotplug
> > >>>> issue.
> > >>> I was thinking how consistency is assured when saving/restoring this
> > >>> h/w avail_index against the one in the virtq memory, particularly in
> > >>> the region_add/.region_del() context (e.g. the hotplug case). Problem
> > >>> is I don't see explicit memory barrier when guest thread updates the
> > >>> avail_index, how does the device make sure the h/w avail_index is
> > >>> uptodate while guest may race with updating the virtq's avail_index in
> > >>> the mean while? Maybe I completely miss something obvious?
> > >> DKIM-Signature: v1; arsa-sha256; crelaxed/relaxed; dnvidia.com; sn1;
> > >> t 12257780; bhHnB0z4VEKwRS3WGY8d836MJgxu5Eln/jbFZlNXVxc08;
> > >> hX-PGP-Universal:Date:From:To:CC:Subject:Message-ID:References:
> > >> MIME-Version:Content-Type:Content-Disposition:
> > >> Content-Transfer-Encoding:In-Reply-To:User-Agent:X-Originating-IP:
> > >> X-ClientProxiedBy;
> > >> bgGmb8+rcn3/rKzKQ/7QzSnghWzZ+FAU0XntsRZYGQ66sFvT7zsYPHogG3LIWNY77t
> > >> wNHPw7GCJrNaH3nEXPbOp0FMOZw4Kv4W7UPuYPobbLeTkvuPAidjB8dM42vz+1X61t
> > >> 9IVQT9X4hnAxRjI5CqZOo41GS4Tl1X+ykGoA+VE80BR/R/+nQ3tXDVULfppzeB+vu3
> > >> TWnnpaZ2GyoNyPlMiyVRkHdXzDVgA4uQHxwHn7otGK5J4lzyu8KrFyQtiP+f6hfu5v
> > >> crJkYS8e9A+rfzUmKWuyHcKcmhPhAVJ4XdpzZcDXXlMHVxG7nR1o88xttC6D1+oNIP
> > >> 9xHI3DkNBpEuA
> > >> If you're asking about syncronization upon hot plug of memory, the
> > >> hardware always goes to read the available index from memory when a new
> > >> hardware object is associted with a virtqueue. You can argue then that
> > >> you don't need to restore the available index and you may be right but
> > >> this is the currect inteface to the firmware.
> > >>
> > >>
> > >> If you're asking on generally how sync is assured when the guest updates
> > >> the available index, can you please send a pointer to the code where you
> > >> see the update without a memory barrier?
> > > This is a snippet of virtqueue_add_split() where avail_index gets
> > > updated by guest:
> > >
> > > /* Put entry in available array (but don't update avail->idx until they
> > > * do sync). */
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > > /* Descriptors and available array need to be set before we expose the
> > > * new available array entries. */
> > > virtio_wmb(vq->weak_barriers);
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> > > vq->num_added++;
> > >
> > > There's memory barrier to make sure the update to descriptor and
> > > available ring is seen before writing to the avail->idx, but there
> > > seems no gurantee that this update would flush to the memory
> > > immmedately either before or after the mlx5-vdpa is suspened and gets
> > > the old avail_index value stashed somewhere. In this case, how does
> > > the hardware ensure the consistency between the guest virtio and host
> > > mlx5-vdpa? Or, it completly relies on guest to update the avail_index
> > > once the next buffer is available, so that the index will be in sync
> > > again?
> >
> >
> > I'm not sure I get the question but notice that the driver should check
> > and notify virtqueue when device want a notification. So there's a
> > virtio_wmb() e.g in:
> >
> > static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> > {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > u16 new, old;
> > bool needs_kick;
> >
> > START_USE(vq);
> > /* We need to expose available array entries before checking avail
> > * event. */
> > virtio_mb(vq->weak_barries);
> >
> > old = vq->split.avail_idx_shadow - vq->num_added;
> > new = vq->split.avail_idx_shadow;
> > vq->num_added = 0;
> >
> > (See the comment above virtio_mb()). So the avail idx is guaranteed to
> > be committed to memroy(cache hierarchy) before the check of
> > notification. I think we sync through this.
>
> Thanks for pointing it out! Indeed, the avail index is synced before
> kicking the device. However, even so I think the race might still be
> possible, see below:
>
>
> mlx5_vdpa device virtio driver
> -------------------------------------------------------------------------
> virtqueue_add_split
> (bumped up avail_idx1 to
> avail_idx2, but update
> not yet committed to mem)
>
> (hot plug memory...)
> mlx5_vdpa_set_map
> mlx5_vdpa_change_map
> suspend_vqs
> suspend_vq
> (avail_idx1 was saved)
> save_channels_info
> :
> : virtqueue_kick_prepare_split
> : (avail_idx2 committed to memory)
> restore_channels_info
> setup_driver
> ...
> create_virtqueue
> (saved avail_idx1
> getting restored;
> whereas current
> avail_idx2 in
> memory is ignored)
> :
> :
> virtqueue_notify
> vp_notify
> mlx5_vdpa_kick_vq
> (device processing up to
> avail_idx1 rather than
> avail_idx2?)
>
>
> According to Eli, the vdpa firmware does not load the latest value,
> i.e. avail_idx2 from memory when re-creating the virtqueue object.

I said exactly the opposite. The hardware always goes to read the
available index from memory. The requirement to configure it when
creating a new object is still a requirement defined by the interface so
I must not violate interface requirments.

> Instead it starts with a stale avail_idx1 that was saved in mlx5_vdpa
> driver private structure before the memory update is made. That is the
> way how the current firmware interface is designed. The thing I'm not
> sure though is if the firmware/device will resync and get to the
> latest avail_idx2 from memory while processing the kick request with
> mlx5_vdpa_kick_vq? If not, a few avail entries will be missing in the
> kick, which could cause odd behavior or implicit failure.
>
> But if the firmware will resync on kick_vq (and get_vq_state), I think
> I completely lost the point of saving and restoring avail_idx when
> changing the memory map...
>
> Thanks,
> -Siwei
>
> >
> > Thanks
> >
> >
> > >
> > > Thanks,
> > > -Siwei
> > >
> > >>> Thanks,
> > >>> -Siwei
> > >>>
> > >>>> But you're correct. When memory is added, I get a new memory map. This
> > >>>> requires me to build a new memory key object which covers the new memory
> > >>>> map. Since the virtqueue objects are referencing this memory key, I need
> > >>>> to destroy them and build new ones referncing the new memory key.
> > >>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>>
> > >>>>>> -Siwei
> > >>>>>>
> > >>>>>>> Thanks
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>>> -Siwei
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> Signed-off-by: Eli Cohen <elic@xxxxxxxxxx>
> > >>>>>>>>>> ---
> > >>>>>>>>>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 --------
> > >>>>>>>>>> 1 file changed, 8 deletions(-)
> > >>>>>>>>>>
> > >>>>>>>>>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> index 88dde3455bfd..549ded074ff3 100644
> > >>>>>>>>>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
> > >>>>>>>>>>
> > >>>>>>>>>> static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
> > >>>>>>>>>> {
> > >>>>>>>>>> - struct mlx5_virtq_attr attr;
> > >>>>>>>>>> -
> > >>>>>>>>>> if (!mvq->initialized)
> > >>>>>>>>>> return;
> > >>>>>>>>>>
> > >>>>>>>>>> @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m
> > >>>>>>>>>>
> > >>>>>>>>>> if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> > >>>>>>>>>> mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
> > >>>>>>>>>> -
> > >>>>>>>>>> - if (query_virtqueue(ndev, mvq, &attr)) {
> > >>>>>>>>>> - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> > >>>>>>>>>> - return;
> > >>>>>>>>>> - }
> > >>>>>>>>>> - mvq->avail_idx = attr.available_index;
> > >>>>>>>>>> }
> > >>>>>>>>>>
> > >>>>>>>>>> static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> > >>>>>>>>>> --
> > >>>>>>>>>> 2.29.2
> > >>>>>>>>>>
> >