Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range

From: Michael Roth
Date: Tue Jun 20 2023 - 12:29:11 EST


On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@xxxxxxxxx wrote:
> From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
>
> TDX and SEV-SNP need to know the reason for a callback by
> kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
> gmem callback. The callback handler changes the behavior or does the
> additional housekeeping operation. For mmu notifier, it's zapping shared
> PTE. For set memory attributes, it's the conversion of memory attributes
> (private <=> shared). For KVM gmem, it's punching a hole in the range, and
> releasing the file.

I think it's still an open topic that we need to hear more from Sean about:

https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@xxxxxxx/

but I *think* we were leaning toward decoupling the act of invalidating
GFNs, vs. the act of invalidating/free'ing gmem pages.

One concrete example of where this seperation makes sense if with
hole-punching. SNP has unique platform-specific stuff it has to do before
free'ing that gmem page back to the host. If we try to plumb this through
kvm_unmap_gfn_range() via a special flag then it's a little awkward
because:

a) Presumably that hole-punch would have occurred after a preceeding
KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
state in the xarray. So all it should really need to do is handle
that platform-specific behavior, like updating RMP table in case of
SNP. But none of the other details like GFN ranges are relevant in
that case, RMP updates here only need the PFN, so we end up walking
memslots to do GFN->PFN translations, when it would actually be much
more efficient to do these translations by translating the
hole-punched FD offset range to the corresponding folio()'s backing
those ranges

b) It places an unecessary dependency on having an active memslot to do
those translations. This ends up not being an issue with current
version of gmem patchset because the release() happens *before*
gmem_unbind(), so there is a memslot associated with the ranges at
gmem_release() time, but in the initial version of gmem it was the
reverse, so if things ever changed again in this regard we'd once
again have to completely rework how to issue these platform-specific
invalidation callbacks.

I really *really* like having a separate, simple invalidation mechanism
in place that just converts FD offsets to PFNs and then passes those on
to platform-defined handlers to clean up pages before free'ing them back
to the system. It's versatile in that it can be called pretty much
anywhere regardless of where we are in KVM lifecycle, it's robust in
that it doesn't rely on unecessary outside dependencies, and it avoids
added uneeded complexity to paths like kvm_unmap_gfn_range().

That's the approach taken with SNP hypervisor v9 series, with the
gmem hook being introduced here:

https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043

and the SEV-SNP implementation of that hook being here:

https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e

Would a similar approach work for TDX? At least WRT cleaning up pages
before returning them back to the host? If we could isolate that
requirement/handling from all the other aspects of invalidations it
really seems like it would cause us less headaches down the road.

>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> ---
> include/linux/kvm_host.h | 11 ++++++++++-
> virt/kvm/guest_mem.c | 10 +++++++---
> virt/kvm/kvm_main.c | 4 +++-
> 3 files changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1a47cedae8a1..c049c0aa44d6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +
> +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)

Can you go into more detail on why special handling is needed for
SET_MEM_ATTR?

> +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
> +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)

Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
TDX case if you take the above approach? For SNP, the answer is yes. If
that's also the case for TDX I think that would be another argument in
favor of decoupling these from existing KVM MMU invalidation path.

-Mike

> +
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> gfn_t end;
> - pte_t pte;
> + union {
> + pte_t pte;
> + u64 attrs;
> + };
> bool may_block;
> + unsigned int flags;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index cdf2d84683c8..30b8f66784d4 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
> }
>
> static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
> - pgoff_t start, pgoff_t end)
> + pgoff_t start, pgoff_t end,
> + unsigned int flags)
> {
> struct kvm_memory_slot *slot;
> unsigned long index;
> @@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
> .slot = slot,
> .pte = __pte(0),
> .may_block = true,
> + .flags = flags,
> };
>
> kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> @@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
> */
> filemap_invalidate_lock(file->f_mapping);
>
> - kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> + kvm_gmem_invalidate_begin(kvm, gmem, start, end,
> + KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE);
>
> truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);
>
> @@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
> * Free the backing memory, and more importantly, zap all SPTEs that
> * pointed at this file.
> */
> - kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul);
> + kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul,
> + KVM_GFN_RANGE_FLAGS_GMEM_RELEASE);
> truncate_inode_pages_final(file->f_mapping);
> kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 422d49634c56..9cdfa2fb675f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
> gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
> + gfn_range.flags = 0;
>
> if (!locked) {
> locked = true;
> @@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> bool flush = false;
> int i;
>
> - gfn_range.pte = __pte(0);
> + gfn_range.attrs = attrs;
> gfn_range.may_block = true;
> + gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;
>
> for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
> --
> 2.25.1
>