Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

From: Jason Wang
Date: Wed Oct 18 2023 - 22:49:40 EST


On Thu, Oct 19, 2023 at 7:21 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
>
>
>
> On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote:
> > On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>
> >>
> >> On 10/17/2023 10:27 PM, Jason Wang wrote:
> >>> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 10/16/2023 7:35 PM, Jason Wang wrote:
> >>>>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >>>>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> >>>>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>>>>>>>> may need to restore iotlb mapping to the initial or default state
> >>>>>>>>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>>>>>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>>>>>>>> state. For instance, device reset does not cause mapping go away on
> >>>>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>>>>>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>>>>>>>> state in vhost_vdpa_cleanup().
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@xxxxxxxxxx>
> >>>>>>>>>>> ---
> >>>>>>>>>>> drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>>>>>>> 1 file changed, 16 insertions(+)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>>>>>> index 851535f..a3f8160 100644
> >>>>>>>>>>> --- a/drivers/vhost/vdpa.c
> >>>>>>>>>>> +++ b/drivers/vhost/vdpa.c
> >>>>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>>>>>>> return vhost_vdpa_alloc_as(v, asid);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + struct vdpa_device *vdpa = v->vdpa;
> >>>>>>>>>>> + const struct vdpa_config_ops *ops = vdpa->config;
> >>>>>>>>>>> +
> >>>>>>>>>>> + if (ops->reset_map)
> >>>>>>>>>>> + ops->reset_map(vdpa, asid);
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>> static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>> {
> >>>>>>>>>>> struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>>
> >>>>>>>>>>> hlist_del(&as->hash_link);
> >>>>>>>>>>> vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>>>>>>>> + /*
> >>>>>>>>>>> + * Devices with vendor specific IOMMU may need to restore
> >>>>>>>>>>> + * iotlb to the initial or default state which is not done
> >>>>>>>>>>> + * through device reset, as the IOTLB mapping manipulation
> >>>>>>>>>>> + * could be decoupled from the virtio device life cycle.
> >>>>>>>>>>> + */
> >>>>>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>>>>>>>> Well, in theory this seems like so but it's unnecessary code change
> >>>>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>>>>>>>> today, and userspace doesn't break as of today. :)
> >>>>>>>> Well, this is one question I've ever asked before. You have explained
> >>>>>>>> that one of the reason that we don't break userspace is that they may
> >>>>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>>>>>>>
> >>>>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>>>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>>>>>>>> across reset (think about the platform IOMMU case), so userspace today
> >>>>>>>>> is already tolerating enough with either good or bad IOMMU.
> >>>>> I'm confused, how to define tolerating here?
> >>>> Tolerating defined as QEMU has to proactively unmap before reset just to
> >>>> workaround the driver bug (on-chip maps out of sync), unconditionally
> >>>> for platform or on-chip. While we all know it doesn't have to do so for
> >>>> platform IOMMU, though userspace has no means to distinguish. That said,
> >>>> userspace is sacrificing reset time performance on platform IOMMU setup
> >>>> just for working around buggy implementation in the other setup.
> >>> Ok, so what you actually mean is that userspace can tolerate the "bug"
> >>> with the performance penalty.
> >> Right.
> >>>
> >>>>> For example, if it has tolerance, why bother?
> >>>> I'm not sure I get the question. But I think userspace is compromising
> >>>> because of buggy implementation in a few drivers doesn't mean we should
> >>>> uniformly enforce such behavior for all set_map/dma_map implementations.
> >>> This is not my point. I meant, we can fix we need a negotiation in
> >>> order to let some "buggy" old user space to survive from the changes.
> >> Userspace is no buggy today, how to define "buggy"? Userspace with
> >> tolerance could survive just fine no matter if this negotiation or buggy
> >> driver behavior emulation is around or not. If any userspace doesn't
> >> tolerate, it can work still fine on good on-chip IOMMU or platform
> >> IOMMU, no matter if the negotiation is around or not.
> >>>>>>>> This code of
> >>>>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>>>>>>>> emulate bad IOMMU behavior even for older userspace (with improper
> >>>>>>>>> emulation to be done it would result in even worse performance).
> >>>>> I can easily imagine a case:
> >>>>>
> >>>>> The old Qemu that works only with a setup like mlx5_vdpa.
> >>>> Noted, seems to me there's no such case of a userspace implementation
> >>>> that only works with mlx5_vdpa or its friends, but doesn't work with the
> >>>> others e.g. platform IOMMU, or well behaving on-chip IOMMU
> >>>> implementations.
> >>> It's not hard to think of a case where:
> >>>
> >>> 1) the environment has mlx5_vdpa only
> >>> 2) kernel doc can't have endless details, so when developing
> >>> application, the author notice IOTLB is cleared during reset
> >> I get it, but my question was that, even if the author had noticed IOTLB
> >> is cleared during reset, does he care or not to make IOTLB back working
> >> again? My point is that, if this old setup is supposed to "work" on
> >> mlx5_vdpa, then the developer must come up with sort of "quirk" to
> >> recover the IOTLB to make it back to working state again after the
> >> reset. It will be more justified to come up with the proper fix for
> >> compatibility/emulation only until we know what should be expected to
> >> work and through which possible means to making it back to work, rather
> >> than blindly emulate the buggy behavior solely based on a few driver's
> >> own implementation. I'm pretty sure there are multiple ways to implement
> >> the buggy reset behavior in the driver, does it mean we have to emulate
> >> various corrupted mapping states in the individual on-chip iommu itself?
> >> How is it able to help the developer user if we are able to replicate
> >> the same corrupted mapping state in the on-chip iommu after reset, any
> >> real-life user only cares about mapping being corrupted in the same way,
> >> rather than cares more about the quirk sequence or work around to get
> >> iotlb maps out of the broken state?
> >>
> >> Only if the userspace is like a test facility to expect some test case
> >> to fail on mlx5_vdpa after reset -- I assume that is not real-life user
> >> at all.
> >>>> The Unmap+remap trick around vdpa reset works totally
> >>>> fine for platform IOMMU, except with sub-optimal performance. Other than
> >>>> this trick, I cannot easily think of other means or iotlb message
> >>>> sequence for userspace to recover the bogus state and make iotlb back to
> >>>> work again after reset.
> >>> Yes for sure, but we can't audit every user space, no?
> >> We don't have to, as userspace here has no bug at all. The bug exists in
> >> the driver not in userspace. Real life userspace app only cares about
> >> making things work not asserting something must be broken.
> >>>> Are we talking about hypnosis that has no real
> >>>> basis to exist in the real world?
> >>> Instead of trying to answer these hard questions, I would go another
> >>> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
> >>> set by the backend. This is much easier.
> >> Please be noted the old (broken) behavior can vary between different
> >> parent driver implementations. It's driver's specific own problem, if
> >> there are N ways to for driver to implement buggy .reset, do we have to
> >> emulate N flavors of different vdpa reset behavior?
> >>
> >>>>> If we do
> >>>>> this without a negotiation, IOTLB will not be clear but the Qemu will
> >>>>> try to re-program the IOTLB after reset. Which will break?
> >>>>>
> >>>>> 1) stick the exact old behaviour with just one line of check
> >>>> It's not just one line of check here, the old behavior emulation has to
> >>>> be done as Eugenio illustrated in the other email.
> >>> For vhost-vDPA it's just
> >>>
> >>> if (IOTLB_PERSIST is acked by userspace)
> >>> reset_map()
> >>>
> >>> For parent, it's somehow similar:
> >>>
> >>> during .reset()
> >>>
> >>> if (IOTLB_PERSIST is not acked by userspace)
> >>> reset_vendor_mappings()
> >>>
> >>> Anything I missed here?
> >> First, the ideal fix would be to leave this reset_vendor_mappings()
> >> emulation code on the individual driver itself, which already has the
> >> broken behavior. But today there's no backend feature negotiation
> >> between vhost-vdpa and the parent driver. Do we want to send down the
> >> acked_backend_features to parent drivers?
> >>
> > What if we add a module parameter to both mlx5 and vdpa_sim to keep
> > the old behavior? Let's call it clean_iotlb_on_reset for now.
> >
> > In my opinion we can leave it off by default, so these userspace apps
> > can get back to the previous behavior. It would be ideal if we set a
> > deprecation date for it though.
> >
> > This way new backends, whether they implement .set_map or not, will
> > have correct behavior.
> >
> > Would that work?
> Great idea, this definitely will work! With this module parameter,
> individual driver still keeps the possibility to revert to previous
> buggy behavior were to unbreak old userspace, code can be obsoleted
> independently per each driver's specific use case and need, and we don't
> necessarily overload vdpa core with too much unwarranted compatibility
> code. Thank you so much for the great suggestion,

I disagree, module parameters have been proved to be very hard for
management. And what I don't understand here is, once a module
parameter can work, why not just replace it with the checking of
IOTLB_PRESIST?

Thanks


> I will post a v3.
>
> Thanks,
> -Siwei
>
> >
> > Thanks!
> >
> >> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> >> backend feature negotiation in parent driver, if vhost-vdpa has to
> >> provide the old-behaviour emulation for compatibility on driver's
> >> behalf, it needs to be done per-driver basis. There could be good
> >> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> >> .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we
> >> should allow these good driver implementations rather than
> >> unconditionally stick to some specific problematic behavior for every
> >> other good driver. Then we need a set of device flags (backend_features
> >> bit again?) to indicate the specific driver needs upper layer's help on
> >> old-behaviour emulation.
> >>
> >> Last but not least, I'm not sure how to properly emulate
> >> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> >> .reset_map op implemented, or if .reset_map has a slightly different
> >> implementation than what it used to reset the iotlb in the .reset op,
> >> then this either becomes effectively dead code if no one ends up using,
> >> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> >> cover all the cases.
> >>
> >>
> >>>> In addition, the
> >>>> emulation has to limit to those buggy drivers as I don't feel this
> >>>> emulation should apply uniformly to all future set_map/dma_map
> >>>> implementations.
> >>> Unfortunately, it's a must to stick to ABI.
> >> How come this brokenness in mlx5_vdpa becomes ABI in any sort for future
> >> on-chip IOMMU drivers? They might not even exist yet. Even if it's
> >> concerning ABI it's limited to mlx5_vdpa and the existing drivers, right?
> >>
> >>> I agree it's a mess but we don't have a better choice.
> >> Well, it's your call, I can implement as you wish but the unwarranted
> >> code has to be maintained forever. Particularly without knowing if
> >> there's really such a use case in real life, and no one in future might
> >> dare to remove the code without knowing what it can be used for.
> >>
> >>> Or we can fail the probe if userspace
> >>> doesn't ack this feature.
> >> Fail probing is even worse choice that is introducing intrusive breakage
> >> to the userspace.
> >>>>> 2) audit all the possible cases to avoid a one line of code
> >>>>>
> >>>>> 1) seems much easier than 2)
> >>>> You see it's more than just one line of code, and I'm uncertain if the
> >>>> additional complexity is warranted or necessary, particularly if added
> >>>> this piece of compatibility code will linger for quite a long time.
> >>> This is a must as long as it can be noticed by userspace. Doing
> >>> something conservative makes more sense to me.
> >>>
> >>>> Instead of adding hypothetical code change for no specific good reason
> >>>> and no real use case,
> >>> It's not adding something new or new behaviours, it's just making the
> >>> IOTLB reset conditional based on vDPA reset.
> >>>
> >>>> I'd like to add the code when we find out a
> >>>> specific use case that may get impacted or already being affected,
> >>> It doesn't conflict with what you proposed here. Old behaviours have
> >>> their users, no?
> >> We don't know the use case how to make thing work instead of make thing
> >> break, that is the problem. We have no way to test if old-behaviour
> >> preserving code really works as expected. If there's no such user in
> >> practice, it ends up with dead code no one dares to remove.
> >>>> then
> >>>> we will have good understanding how to code up the fix and emulate
> >>>> properly for compatibility, while not affecting other good implementations.
> >>> The issue is, even if we can't find a userspace now. It doesn't mean
> >>> we can't have one in the future. Then it might be too late or too
> >>> tricky to fix them. We had a lot of lessons in the past.
> >> I am not sure the same situation "too late to fix" or "too tricky to
> >> fix" applies here. Usually this means there's some well established
> >> pattern for e.g. API, ABI or long standing de-factor behavior that can't
> >> be broken or adjust if trying to fix something up. But here we're
> >> guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally
> >> ruled by implementation.
> >>
> >> Regards,
> >> -Siwei
> >>
> >>> Thanks
> >>>
> >>>> Thanks,
> >>>> -Siwe/i/
> >>>>
> >>>>>>>> For two reasons:
> >>>>>>>>
> >>>>>>>> 1) backend features need acked by userspace this is by design
> >>>>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> >>>>>>>> every userspace program
> >>>>>>>>
> >>>>>>> The old behavior (without flag ack) cannot be trusted already, as:
> >>>>> Possibly but the point is to unbreak userspace no matter how weird the
> >>>>> behaviour we've ever had.
> >>>>>
> >>>>>>> * Devices using platform IOMMU (in other words, not implementing
> >>>>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> >>>>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> >>>>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> >>>>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> >>>>>>> now all backends work the same as far as I know., which was (and is)
> >>>>>>> the way devices using the platform IOMMU works.
> >>>>>>>
> >>>>>>> The difference in behavior did not matter as QEMU unmaps all the
> >>>>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> >>>>>>> started = false),
> >>>>>> Exactly. It's not just QEMU, but any (older) userspace manipulates
> >>>>>> mappings through the vhost-vdpa iotlb interface has to unmap all
> >>>>>> mappings to workaround the vdpa parent driver bug.
> >>>>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> >>>>>
> >>>>>> If they don't do
> >>>>>> explicit unmap, it would cause state inconsistency between vhost-vdpa
> >>>>>> and parent driver, then old mappings can't be restored, and new mapping
> >>>>>> can be added to iotlb after vDPA reset. There's no point to preserve
> >>>>>> this broken and inconsistent behavior between vhost-vdpa and parent
> >>>>>> driver, as userspace doesn't care at all!
> >>>>> It's a userspace notice change so we can't fix it silently:
> >>>>>
> >>>>> https://lkml.org/lkml/2012/12/23/75
> >>>>>
> >>>>> Another example which is related to vhost-vDPA:
> >>>>>
> >>>>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@xxxxxxxxxx/T/
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>> but the backend acknowledging this feature flag
> >>>>>>> allows QEMU to make sure it is safe to skip this unmap & map in the
> >>>>>>> case of vhost stop & start cycle.
> >>>>>>>
> >>>>>>> In that sense, this feature flag is actually a signal for userspace to
> >>>>>>> know that the bug has been solved.
> >>>>>> Right, I couldn't say it better than you do, thanks! The feature flag is
> >>>>>> more of an unusual means to indicating kernel bug having been fixed,
> >>>>>> rather than introduce a new feature or new kernel behavior ending up in
> >>>>>> change of userspace's expectation.
> >>>>>>
> >>>>>>> Not offering it indicates that
> >>>>>>> userspace cannot trust the kernel will retain the maps.
> >>>>>>>
> >>>>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> >>>>>>> free to use the text in case you find more clear in doc or patch log.
> >>>>>> Sure, will do, thank you! Will post v2 adding these to the log.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>>> I think
> >>>>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>>>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> -Siwei
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@xxxxxxxxxx/
> >>>>>>>>> [2]
> >>>>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@xxxxxxxxxx/
> >>>>>>>>>> Otherwise
> >>>>>>>>>> we may break old userspace.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>> + vhost_vdpa_reset_map(v, asid);
> >>>>>>>>>>> kfree(as);
> >>>>>>>>>>>
> >>>>>>>>>>> return 0;
> >>>>>>>>>>> --
> >>>>>>>>>>> 1.8.3.1
> >>>>>>>>>>>
>