Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

From: Jason Wang
Date: Wed Oct 18 2023 - 01:28:13 EST


On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
>
>
>
> On 10/16/2023 7:35 PM, Jason Wang wrote:
> > On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>
> >>
> >> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> >>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>>>> may need to restore iotlb mapping to the initial or default state
> >>>>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>>>> state. For instance, device reset does not cause mapping go away on
> >>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>>>> state in vhost_vdpa_cleanup().
> >>>>>>>
> >>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@xxxxxxxxxx>
> >>>>>>> ---
> >>>>>>> drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>>> 1 file changed, 16 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>> index 851535f..a3f8160 100644
> >>>>>>> --- a/drivers/vhost/vdpa.c
> >>>>>>> +++ b/drivers/vhost/vdpa.c
> >>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>>> return vhost_vdpa_alloc_as(v, asid);
> >>>>>>> }
> >>>>>>>
> >>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>>>> +{
> >>>>>>> + struct vdpa_device *vdpa = v->vdpa;
> >>>>>>> + const struct vdpa_config_ops *ops = vdpa->config;
> >>>>>>> +
> >>>>>>> + if (ops->reset_map)
> >>>>>>> + ops->reset_map(vdpa, asid);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>> {
> >>>>>>> struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>
> >>>>>>> hlist_del(&as->hash_link);
> >>>>>>> vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>>>> + /*
> >>>>>>> + * Devices with vendor specific IOMMU may need to restore
> >>>>>>> + * iotlb to the initial or default state which is not done
> >>>>>>> + * through device reset, as the IOTLB mapping manipulation
> >>>>>>> + * could be decoupled from the virtio device life cycle.
> >>>>>>> + */
> >>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>>>> Well, in theory this seems like so but it's unnecessary code change
> >>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>>>> today, and userspace doesn't break as of today. :)
> >>>> Well, this is one question I've ever asked before. You have explained
> >>>> that one of the reason that we don't break userspace is that they may
> >>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>>>
> >>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>>>> across reset (think about the platform IOMMU case), so userspace today
> >>>>> is already tolerating enough with either good or bad IOMMU.
> > I'm confused, how to define tolerating here?
>
> Tolerating defined as QEMU has to proactively unmap before reset just to
> workaround the driver bug (on-chip maps out of sync), unconditionally
> for platform or on-chip. While we all know it doesn't have to do so for
> platform IOMMU, though userspace has no means to distinguish. That said,
> userspace is sacrificing reset time performance on platform IOMMU setup
> just for working around buggy implementation in the other setup.

Ok, so what you actually mean is that userspace can tolerate the "bug"
with the performance penalty.


>
> > For example, if it has tolerance, why bother?
> I'm not sure I get the question. But I think userspace is compromising
> because of buggy implementation in a few drivers doesn't mean we should
> uniformly enforce such behavior for all set_map/dma_map implementations.

This is not my point. I meant, we can fix we need a negotiation in
order to let some "buggy" old user space to survive from the changes.

>
> >
> >>>> This code of
> >>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>>>> emulate bad IOMMU behavior even for older userspace (with improper
> >>>>> emulation to be done it would result in even worse performance).
> > I can easily imagine a case:
> >
> > The old Qemu that works only with a setup like mlx5_vdpa.
> Noted, seems to me there's no such case of a userspace implementation
> that only works with mlx5_vdpa or its friends, but doesn't work with the
> others e.g. platform IOMMU, or well behaving on-chip IOMMU
> implementations.

It's not hard to think of a case where:

1) the environment has mlx5_vdpa only
2) kernel doc can't have endless details, so when developing
application, the author notice IOTLB is cleared during reset

> The Unmap+remap trick around vdpa reset works totally
> fine for platform IOMMU, except with sub-optimal performance. Other than
> this trick, I cannot easily think of other means or iotlb message
> sequence for userspace to recover the bogus state and make iotlb back to
> work again after reset.

Yes for sure, but we can't audit every user space, no?

> Are we talking about hypnosis that has no real
> basis to exist in the real world?

Instead of trying to answer these hard questions, I would go another
way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
set by the backend. This is much easier.

>
> > If we do
> > this without a negotiation, IOTLB will not be clear but the Qemu will
> > try to re-program the IOTLB after reset. Which will break?
> >
> > 1) stick the exact old behaviour with just one line of check
> It's not just one line of check here, the old behavior emulation has to
> be done as Eugenio illustrated in the other email.

For vhost-vDPA it's just

if (IOTLB_PERSIST is acked by userspace)
reset_map()

For parent, it's somehow similar:

during .reset()

if (IOTLB_PERSIST is not acked by userspace)
reset_vendor_mappings()

Anything I missed here?

> In addition, the
> emulation has to limit to those buggy drivers as I don't feel this
> emulation should apply uniformly to all future set_map/dma_map
> implementations.

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

> > 2) audit all the possible cases to avoid a one line of code
> >
> > 1) seems much easier than 2)
> You see it's more than just one line of code, and I'm uncertain if the
> additional complexity is warranted or necessary, particularly if added
> this piece of compatibility code will linger for quite a long time.

This is a must as long as it can be noticed by userspace. Doing
something conservative makes more sense to me.

> Instead of adding hypothetical code change for no specific good reason
> and no real use case,

It's not adding something new or new behaviours, it's just making the
IOTLB reset conditional based on vDPA reset.

> I'd like to add the code when we find out a
> specific use case that may get impacted or already being affected,

It doesn't conflict with what you proposed here. Old behaviours have
their users, no?

> then
> we will have good understanding how to code up the fix and emulate
> properly for compatibility, while not affecting other good implementations.

The issue is, even if we can't find a userspace now. It doesn't mean
we can't have one in the future. Then it might be too late or too
tricky to fix them. We had a lot of lessons in the past.

Thanks

>
> Thanks,
> -Siwe/i/
>
> >
> >>>> For two reasons:
> >>>>
> >>>> 1) backend features need acked by userspace this is by design
> >>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> >>>> every userspace program
> >>>>
> >>> The old behavior (without flag ack) cannot be trusted already, as:
> > Possibly but the point is to unbreak userspace no matter how weird the
> > behaviour we've ever had.
> >
> >>> * Devices using platform IOMMU (in other words, not implementing
> >>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> >>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> >>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> >>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> >>> now all backends work the same as far as I know., which was (and is)
> >>> the way devices using the platform IOMMU works.
> >>>
> >>> The difference in behavior did not matter as QEMU unmaps all the
> >>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> >>> started = false),
> >> Exactly. It's not just QEMU, but any (older) userspace manipulates
> >> mappings through the vhost-vdpa iotlb interface has to unmap all
> >> mappings to workaround the vdpa parent driver bug.
> > Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> >
> >> If they don't do
> >> explicit unmap, it would cause state inconsistency between vhost-vdpa
> >> and parent driver, then old mappings can't be restored, and new mapping
> >> can be added to iotlb after vDPA reset. There's no point to preserve
> >> this broken and inconsistent behavior between vhost-vdpa and parent
> >> driver, as userspace doesn't care at all!
> > It's a userspace notice change so we can't fix it silently:
> >
> > https://lkml.org/lkml/2012/12/23/75
> >
> > Another example which is related to vhost-vDPA:
> >
> > https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@xxxxxxxxxx/T/
> >
> > Thanks
> >
> >>> but the backend acknowledging this feature flag
> >>> allows QEMU to make sure it is safe to skip this unmap & map in the
> >>> case of vhost stop & start cycle.
> >>>
> >>> In that sense, this feature flag is actually a signal for userspace to
> >>> know that the bug has been solved.
> >> Right, I couldn't say it better than you do, thanks! The feature flag is
> >> more of an unusual means to indicating kernel bug having been fixed,
> >> rather than introduce a new feature or new kernel behavior ending up in
> >> change of userspace's expectation.
> >>
> >>> Not offering it indicates that
> >>> userspace cannot trust the kernel will retain the maps.
> >>>
> >>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> >>> free to use the text in case you find more clear in doc or patch log.
> >> Sure, will do, thank you! Will post v2 adding these to the log.
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>
> >>
> >>> Thanks!
> >>>
> >>>> Thanks
> >>>>
> >>>>> I think
> >>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>>>
> >>>>> Thanks,
> >>>>> -Siwei
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@xxxxxxxxxx/
> >>>>> [2]
> >>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@xxxxxxxxxx/
> >>>>>> Otherwise
> >>>>>> we may break old userspace.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>> + vhost_vdpa_reset_map(v, asid);
> >>>>>>> kfree(as);
> >>>>>>>
> >>>>>>> return 0;
> >>>>>>> --
> >>>>>>> 1.8.3.1
> >>>>>>>
>