Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory

From: Steven Price
Date: Fri Feb 09 2024 - 10:02:36 EST

Hi Sean,

Thanks for the reply.

On 09/02/2024 14:28, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
>> On 16/10/2023 12:50, Michael Roth wrote:
>>> In some cases, like with SEV-SNP, guest memory needs to be updated in a
>>> platform-specific manner before it can be safely freed back to the host.
>>> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
>>> allow for special handling of this sort when freeing memory in response
>>> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
>>> ahead and define an arch-specific hook for x86 since it will be needed
>>> for handling memory used for SEV-SNP guests.
>> Hi all,
>> Arm CCA has a similar need to prepare/unprepare memory (granule
>> delegate/undelegate using our terminology) before it is used for
>> protected memory.
>> However I see a problem with the current gmem implementation that the
>> "invalidations" are not precise enough for our RMI API. When punching a
>> hole in the memfd the code currently hits the same path (ending in
>> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
>> the shared version).
>> The Arm CCA architecture doesn't allow the protected memory to be removed and
>> refaulted without the permission of the guest (the memory contents would be
>> wiped in this case).
> TDX behaves almost exactly like CCA. Well, that's not technically true, strictly
> speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
> !PRESENT, but those aren't in play for invalidation events like this.

Ok, great I was under the impression they were similar.

> SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
> would do, is destructive, so SNP also behaves the same way for all intents and
> purposes.

Zapping page table mappings is what the invalidate calls imply. This is
something CCA can't do. Obviously fully removing the page would be

>> One option that I've considered is to implement a seperate CCA ioctl to
>> notify KVM whether the memory should be mapped protected.

Sorry, I really didn't explain that well. Yes effectively this is the
attribute flag, but there's corner cases for destruction of the VM. My
thought was that if the VMM wanted to tear down part of the protected
range (without making it shared) then a separate ioctl would be needed
to notify KVM of the unmap.

>> The invalidations would then be ignored on ranges that are currently
>> protected for this guest.
> That's backwards. Invalidations on a guest_memfd should affect only *protected*
> mappings. And for that, the plan/proposal is to plumb only_{shared,private} flags
> into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
> mappings, and mmu_notifier invalidation don't zap private mappings. Sample usage
> in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
> patch despite, I only provided a rough sketch).

Aha, this sounds much like my option 3 below - a way to tell if the
invalidate comes from guest_memfd as opposed to VMA changes.

> [1]
> [2]
>> This 'solves' the problem nicely except for the case where the VMM
>> deliberately punches holes in memory which the guest is using.
> I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive,
> so don't do that.

A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
my concern here is a VMM which is trying to break the host. In this case
either the PUNCH_HOLE needs to fail, or we actually need to recover the
memory from the guest (effectively killing the guest in the process).

>> The issue in this case is that there's no way of failing the punch hole
>> operation - we can detect that the memory is in use and shouldn't be
>> freed, but this callback doesn't give the opportunity to actually block
>> the freeing of the memory.
> Why is this KVM's problem? E.g. the same exact thing happens without guest_memfd
> if userspace munmap()s memory the guest is using.

Indeed. The difference here is that for a normal non-realm guest the
pages can be removed from the page-table and refaulted on a later
access. Indeed there's nothing stopping the VMM from using freeing the
pages and reallocating them later.

For a realm guest if the memory is pulled from the guest then the guest
is effectively dead (at least until migration is implemented but even
then there's going to be a specific controlled mechanism).

>> Sadly there's no easy way to map from a physical page in a gmem back to
>> which VM (and where in the VM) the page is mapped. So actually ripping
>> the page out of the appropriate VM isn't really possible in this case.
> I don't follow. guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
> not know what exactly needs to be invalidated?

At the point that gmem calls kvm_mmu_unmap_gfn_range() the fact that the
range is a gmem is lost.

>> How is this situation handled on x86? Is it possible to invalidate and
>> then refault a protected page without affecting the memory contents? My
>> guess is yes and that is a CCA specific problem - is my understanding
>> correct?
>> My current thoughts for CCA are one of three options:
>> 1. Represent shared and protected memory as two separate memslots. This
>> matches the underlying architecture more closely (the top address bit is
>> repurposed as a 'shared' flag), but I don't like it because it's a
>> deviation from other CoCo architectures (notably pKVM).
>> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
>> guest's protected space. Again, this is CCA being different and also
>> creates nasty corner cases where the gmem descriptor could have to
>> outlive the VMM - so looks like a potential source of memory leaks.
>> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
>> yet prototyped it but it might be possible to simply provide a flag from
>> kvm_gmem_invalidate_begin specifying that the invalidation is for the
>> protected memory. KVM would then only unmap the protected memory when
>> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
>> Fairly obviously (3) is my preferred option, but it relies on the
>> guarantees that the "invalidation" is actually a precise set of
>> addresses where the memory is actually being freed.
> #3 is what we are planning for x86, and except for the only_{shared,private} flags,
> the requisite functionality should already be in Linus' tree, though it does need
> to be wired up for ARM.

Thanks, looks like the only_{shared,private} flags should do it. My only
worry about that solution was that it implicitly changes the
"invalidation" when only_private==1 to a precise list of pages that are
to be unmapped. Whereas for a normal guest it's only a performance issue
if a larger region is invalidated, for a CoCo guest it would be fatal to
the guest.

I'll cherry-pick the "KVM: Add new members to struct kvm_gfn_range to
operate on" patch from the TDX tree as I think this should do the trick.
I have hacked up something similar and it looks like it should work.