Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()

From: Kalra, Ashish
Date: Mon Aug 21 2023 - 17:44:57 EST


Hello Mingwei & Sean,

On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
+Jacky Li

On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

+Mingwei to correct me if I'm wrong

On Fri, Aug 18, 2023, Ashish Kalra wrote:

On 8/18/2023 12:55 PM, Sean Christopherson wrote:
On Tue, Aug 15, 2023, isaku.yamahata@xxxxxxxxx wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
and it's protected by kvm::mmu_lock. call kvm_mmu_invalidate_end() before
unlocking it. Not after the unlock.

Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")

This fixes is wrong. It won't matter in the long run, but it makes my life that
much harder.

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
---
virt/kvm/kvm_main.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8bfeb615fc4d..49380cd62367 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
} arg;
gfn_handler_t handler;
on_lock_fn_t on_lock;
+ on_unlock_fn_t before_unlock;
on_unlock_fn_t on_unlock;

Ugh, shame on my past me. Having on_lock and on_unlock be asymmetrical with respect
to the lock is nasty.

I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
different way.

The SEV hook doesn't actually care about running immediately after unlock, it just
wants to know if there was an overlapping memslot. It can run after SRCU is dropped,
because even if we make the behavior more precise (right now it blasts WBINVD),
just having a reference to memslots isn't sufficient, the code needs to guarantee
memslots are *stable*. And that is already guaranteed by the notifier code, i.e.
the SEV code could just reacquire SRCU.

On a separate note here, the SEV hook blasting WBINVD is still causing
serious performance degradation issues with SNP triggered via
AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
it, we have plans to replace WBINVD with CLFLUSHOPT.

Isn't the flush unnecessary when freeing shared memory? My recollection is that
the problematic scenario is when encrypted memory is freed back to the host,
because KVM already flushes when potentially encrypted mapping memory into the
guest.

With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
hva-based mmu_notifiers. gmem should have full control of the page lifecycles,
i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
need the nuclear option.

E.g. something like this?

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 07756b7348ae..1c6828ae391d 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)

void sev_guest_memory_reclaimed(struct kvm *kvm)
{
- if (!sev_guest(kvm))
+ if (!sev_guest(kvm) || sev_snp_guest(kvm))
return;

wbinvd_on_all_cpus();

I hope this is the final solution :)

So, short answer: no.

SNP+guest_memfd prevent untrusted host user space from directly
modifying the data, this is good enough for CVE-2022-0171, but there
is no such guarantee that the host kernel in some scenarios could
access the data and generate dirty caches. In fact, AFAIC, SNP VM does
not track whether each page is previously shared, isn't it? If a page
was previously shared and was written by the host kernel or devices
before it was changed to private. No one tracks it and dirty caches
are there!

So, to avoid any corner case situations like the above, it seems
currently we have to retain the property: flushing the cache when the
guest memory mapping leaves KVM NPT.

Of course, this is fundamentally because SME_COHERENT only applies to
CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
longer needed. Alternatively, we need extra bookkeeping for KVM to
know whether each page has dirty cache lines. Another alternative is
to filter mmu_notifier reasons, which is the part that I am planning
to take. thoughts?


Now running SNP+guest_memfd with discard=both option enabled:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
unmap_mapping_range+279
shmem_fallocate+932
vfs_fallocate+345
__x64_sys_fallocate+71
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
madvise_vma_behavior+1967
madvise_walk_vmas+190
do_madvise.part.0+264
__x64_sys_madvise+98
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 69677

The maximum hits are seen with shmem_fallocate and madvise, which we believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.

Now, as with SNP+guest_memfd, guest private memory is not mapped in host anymore, so i added a generic fix (instead of Sean's proposed patch of checking for SNP guest inside sev_guest_memory_reclaimed()):

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;

slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (kvm_slot_can_be_private(slot)) {
+ continue;
+ }
hva_start = max(range->start, slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));

With this fix added, the traces are as follows:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]:
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
#

As expected, the SEV hook is not invoked for the guest private memory pages (no more invalidation from shmem_fallocate() + madvise()).

Isn't it better to skip invoking the KVM MMU invalidation notifier when the invalidated range belongs to guest private memory ?

> In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!

The skipped invalidation here covered the case Mingwei mentioned above, where the pages are changed from private->shared and subsequent freeing of shared pages triggered the invalidation.

But, then why are we concerned about this, i thought we have concerns about the case where the dirty cache lines contain encrypted guest data ?

Thanks,
Ashish