Re: [PATCH v3 1/6] kvm: determine memory type from VMA

From: Jason Gunthorpe
Date: Wed Jun 14 2023 - 08:44:17 EST

Next message: Dan Williams: "RE: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()"
Previous message: Richard W.M. Jones: "Re: printk.time causes rare kernel boot hangs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, May 31, 2023 at 12:35:57PM +0100, Catalin Marinas wrote:

> Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
> Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
> user mapping is fine since the VMM may not have detailed knowledge about
> how to safely map the BAR. KVM doesn't have such detailed knowledge
> either in the device pass-through case but for safety reasons it follows
> the same attributes as the user mapping.

To put a fine point on this - it means the KVM VM does not emulate the
same behaviors as a bare metal ARM system and this results in
significant performance degradation (loss of WC) or malfunction (loss
of cachable) when working with some VFIO device scenarios.

> architecture). Presumably cloud vendors allowing device pass-through
> already know their system well enough, no need for further policing in
> KVM (which it can't do properly anyway).

This also matches x86 pretty well. There is alot of platform and BIOS
magic at work to make sure that VFIO is safe in general. As this is
not discoverable by the OS we have no choice in the kernel but to
assume the operator knows what they are doing.

> While I think it's less likely for the VMM to access the same BAR that
> was mapped into the guest, if we want to be on the safe side from an ABI
> perspective (the returned mmap() now has different memory guarantees),
> we could make this an opt-in. That's pretty much the VMM stating that it
> is ok with losing some of the Device guarantees for the vfio-pci
> mapping.

I don't know of any use case where this matters. The VMM is already
basically not allowed to touch the MMIO spaces for security
reasons. The only reason it is mmap'd into the VMM at all is because
this is the only way to get it into KVM.

So an opt-in seems like overkill to me.

> Going back to this series, allowing Cacheable mapping in KVM has similar
> implications as above. (2) and (3) would be assumed safe by the platform
> vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
> (force write-back) is present so that KVM can enforce a cacheable
> mapping at Stage 2 if the vfio variant driver also maps it as cacheable
> in user space.

Yes, currently if FWB is not available this patch makes KVM crash :\
FWB needs to be enforced as well in this patch.

> There were some other thoughts on only allowing different attributes in
> KVM if the user counterpart does not have any mapping (e.g. fd-based
> KVM memory slots). However, this won't work well with CXL-attached
> memory for example where the VMM may want access to it (e.g. virtio). So
> I wouldn't spend more thoughts on this.

CXL is a different beast - struct page backed CXL memory should be
force-cachable everywhere just like normal system memory (it is normal
system memory). This is the kind of memory that might be mixed with
virtio.

However, some future CXL VFIO device memory should ideally allow the
VM full flexibility to use cachable or not to properly emulate
bare-metal. This approach where we pre-select some of the CXL memory
as cachable is kind of a half job..

I think providing a way for KVM to take in a FD instead of a VMA for
mapping memory could be interesting in general. We waste a fair amount
of PTEs creating a 40GB linear map in the CPU just for KVM. Avoiding
the alias mapping is a nice side effect.

We'd have to do something else for the SIGBUS stuff though..

> a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
> (pgprot_writecombine). TBD whether we need a VMM opt-in is at the

We are working on a tested patch for this, it isn't working yet for
some reason, still debugging.

> b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
> corresponding cacheable mapping.

That is this patch, with some improvements to check FWB/etc.

Thanks,
Jason

Next message: Dan Williams: "RE: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()"
Previous message: Richard W.M. Jones: "Re: printk.time causes rare kernel boot hangs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]