Re: [PATCH v3 2/2] kvm: arm64: set io memory s2 pte as normalnc for vfio pci devices

From: Jason Gunthorpe
Date: Tue Jan 02 2024 - 12:26:44 EST


On Wed, Dec 13, 2023 at 08:05:29PM +0000, Oliver Upton wrote:
> Hi,
>
> Sorry, a bit late to the discussion :)
>
> On Tue, Dec 12, 2023 at 02:11:56PM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 12, 2023 at 05:46:34PM +0000, Catalin Marinas wrote:
> > > should know the implications. There's also an expectation that the
> > > actual driver (KVM guests) or maybe later DPDK can choose the safe
> > > non-cacheable or write-combine (Linux terminology) attributes for the
> > > BAR.
> >
> > DPDK won't rely on this interface
>
> Wait, so what's the expected interface for determining the memory
> attributes at stage-1? I'm somewhat concerned that we're conflating two
> things here:

Someday we will have a VFIO ioctl interface to request individual
pages within a BAR be mmap'd with pgprot_writecombine(). Only
something like DPDK would call this ioctl, it would not be used by a
VMM.

> 1) KVM needs to know the memory attributes to use at stage-2, which
> isn't fundamentally different from what's needed for userspace
> stage-1 mappings.
>
> 2) KVM additionally needs a hint that the device / VFIO can handle
> mismatched aliases w/o the machine exploding. This goes beyond
> supporting Normal-NC mappings at stage-2 and is really a bug
> with our current scheme (nGnRnE at stage-1, nGnRE at stage-2).

Not at all.

This whole issue comes from a fear that some HW will experience an
uncontained failure if NORMAL_NC is used for access to MMIO memory.
Marc pointed at some of the GIC registers as a possible concrete
example of this (though nobody has come with a concrete example in the
VFIO space).

When KVM sets the S2 memory types it is primarily making a decision
what memory types the VM is *NOT* permitted to use, which is
fundamentally based on what kind of physical device is behind that
memory and if the VMM is able to manage the cache.

Ie the purpose of the S2 memory types is to restrict allowed VM memory
types to protect the integrity of the machine and hypervisor from the
VM.

Thus we have what this series does. In most cases KVM will continue to
do as it does today and restrict MMIO memory to Device_XX. We have a
new kind of VMA flag that says this physical memory can be safe with
Device_* and Normal_NC, which causes KVM to stop blocking VM use of
those memory types.

> I was hoping that (1) could be some 'common' plumbing for both userspace
> and KVM mappings. And for (2), any case where a device is intolerant of
> mismatches && KVM cannot force the memory attributes should be rejected.

It has nothing to do with mismatches. Catalin explained this in his
other email.

> AFAICT, the only reason PCI devices can get the blanket treatment of
> Normal-NC at stage-2 is because userspace has a Device-* mapping and can't
> speculatively load from the alias. This feels a bit hacky, and maybe we
> should prioritize an interface for mapping a device into a VM w/o a
> valid userspace mapping.

Userspace has a device-* mapping, yes, that is because userspace can't
know anything better.

> I very much understand that this has been going on for a while, and we
> need to do *something* to get passthrough working well for devices that
> like 'WC'. I just want to make sure we don't paint ourselves into a corner
> that's hard to get out of in the future.

Fundamentally KVM needs to understand the restrictions of the
underlying physical MMIO, and this has to be a secure indication from
the kernel component supplying the memory to KVM consuming it. Here we
are using a VMA flag, but any other behind-the-scenes scheme would
work in the future.

Jason