Re: [PATCH v3 2/2] kvm: arm64: set io memory s2 pte as normalnc for vfio pci devices

From: Lorenzo Pieralisi
Date: Thu Dec 14 2023 - 10:48:30 EST


[+James]

On Wed, Dec 13, 2023 at 08:05:29PM +0000, Oliver Upton wrote:
> Hi,
>
> Sorry, a bit late to the discussion :)
>
> On Tue, Dec 12, 2023 at 02:11:56PM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 12, 2023 at 05:46:34PM +0000, Catalin Marinas wrote:
> > > should know the implications. There's also an expectation that the
> > > actual driver (KVM guests) or maybe later DPDK can choose the safe
> > > non-cacheable or write-combine (Linux terminology) attributes for the
> > > BAR.
> >
> > DPDK won't rely on this interface
>
> Wait, so what's the expected interface for determining the memory
> attributes at stage-1? I'm somewhat concerned that we're conflating two
> things here:
>
> 1) KVM needs to know the memory attributes to use at stage-2, which
> isn't fundamentally different from what's needed for userspace
> stage-1 mappings.
>
> 2) KVM additionally needs a hint that the device / VFIO can handle
> mismatched aliases w/o the machine exploding. This goes beyond
> supporting Normal-NC mappings at stage-2 and is really a bug
> with our current scheme (nGnRnE at stage-1, nGnRE at stage-2).
>
> I was hoping that (1) could be some 'common' plumbing for both userspace
> and KVM mappings. And for (2), any case where a device is intolerant of
> mismatches && KVM cannot force the memory attributes should be rejected.
>
> AFAICT, the only reason PCI devices can get the blanket treatment of
> Normal-NC at stage-2 is because userspace has a Device-* mapping and can't
> speculatively load from the alias. This feels a bit hacky, and maybe we
> should prioritize an interface for mapping a device into a VM w/o a
> valid userspace mapping.

FWIW - I have tried to summarize the reasoning behind PCIe devices
Normal-NC default stage-2 safety in a document that I have just realized
now it has become this series cover letter, I don't think the PCI blanket
treatment is related *only* to the current user space mappings (ie
BTW, AFAICS it is also *possible* at present to map a prefetchable BAR through
sysfs with Normal-NC memory attributes in the host at the same time a PCI
device is passed-through to a guest with VFIO - and therefore we have a
dev-nGnRnE stage-1 mapping for it. Don't think anyone does that - what for -
but it is possible and KVM would not know about it).

Again, FWIW, we were told (source Arm ARM) mismatched aliases concerning
device-XXX vs Normal-NC are not problematic as long as the transactions
issued for the related mappings are independent (and none of the
mappings is cacheable).

I appreciate this is not enough to give everyone full confidence on
this solution robustness - that's why I wrote that up so that we know
what we are up against and write KVM interfaces accordingly.

> I very much understand that this has been going on for a while, and we
> need to do *something* to get passthrough working well for devices that
> like 'WC'. I just want to make sure we don't paint ourselves into a corner
> that's hard to get out of in the future.

That makes perfect sense, see above, if there is anything we can do
to clarify we will, in whatever shape it is preferred.

Thanks,
Lorenzo