Re: [PATCH] KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated

From: Jeremi Piotrowski
Date: Fri Apr 14 2023 - 07:59:05 EST


On 4/14/2023 1:12 AM, Sean Christopherson wrote:
> Preserve TDP MMU roots until they are explicitly invalidated by gifting
> the TDP MMU itself a reference to a root when it is allocated. Keeping a
> reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
> performance, and can potentially even soft-hang a vCPU, if a vCPU
> frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
>
> When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
> and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
> cache of previous roots). Unloading roots is a simple way to ensure KVM
> flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
> when allocating a "new" root (from the vCPU's perspective).
>
> In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
> a per-VM hash table. Unloading a shadow MMU root just wipes it from the
> per-vCPU cache; the root is still tracked in the per-VM hash table. When
> KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
> in the per-VM hash table.
>
> Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
> per-VM structure, where "active" in this case means a root is either
> in-use or cached as a previous root by at least one vCPU. When a TDP MMU
> root becomes inactive, i.e. the last vCPU reference to the root is put,
> KVM immediately frees the root (asterisk on "immediately" as the actual
> freeing may be done by a worker, but for all intents and purposes the root
> is gone).
>
> The TDP MMU behavior is especially problematic for 1-vCPU setups, as
> unloading all roots effectively frees all roots. The issue is mitigated
> to some degree in multi-vCPU setups as a different vCPU usually holds a
> reference to an unloaded root and thus keeps the root alive, allowing the
> vCPU to reuse its old root after unloading (with a flush+sync).
>
> The TDP MMU flaw has been known for some time, as until very recently,
> KVM's handling of CR0.WP also triggered unloading of all roots. The
> CR0.WP toggling scenario was eventually addressed by not unloading roots
> when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
> for emulating SMM as KVM must emulate a full TLB flush on entry and exit
> to/from SMM. Given that the shadow MMU plays nice with unloading roots
> at will, teaching the TDP MMU to do the same is far less complex than
> modifying KVM to track which roots need to be flushed before reuse.
>
> Note, preserving all possible TDP MMU roots is not a concern with respect
> to memory consumption. Now that the role for direct MMUs doesn't include
> information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
> are _at most_ six possible roots (where "guest_mode" here means L2):
>
> 1. 4-level !SMM !guest_mode
> 2. 4-level SMM !guest_mode
> 3. 5-level !SMM !guest_mode
> 4. 5-level SMM !guest_mode
> 5. 4-level !SMM guest_mode
> 6. 5-level !SMM guest_mode
>
> And because each vCPU can track 4 valid roots, a VM can already have all
> 6 root combinations live at any given time. Not to mention that, in
> practice, no sane VMM will advertise different guest.MAXPHYADDR values
> across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
> a single VM. Furthermore, the vast majority of modern hypervisors will
> utilize EPT/NPT when available, thus the guest_mode=%true cases are also
> unlikely to be utilized.
>
> Reported-by: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>
> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@xxxxxxxxxxxxxxxxxxx
> Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
> Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@xxxxxxxxxxxxxx
> Cc: David Matlack <dmatlack@xxxxxxxxxx>
> Cc: Ben Gardon <bgardon@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++-------------------------
> 1 file changed, 28 insertions(+), 52 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b2fca11b91ff..343deccab511 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -40,7 +40,17 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
>
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> {
> - /* Also waits for any queued work items. */
> + /*
> + * Invalidate all roots, which besides the obvious, schedules all roots
> + * for zapping and thus puts the TDP MMU's reference to each root, i.e.
> + * ultimately frees all roots.
> + */
> + kvm_tdp_mmu_invalidate_all_roots(kvm);
> +
> + /*
> + * Destroying a workqueue also first flushes the workqueue, i.e. no
> + * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
> + */
> destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
>
> WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> @@ -116,16 +126,6 @@ static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root
> queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> }
>
> -static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
> -{
> - union kvm_mmu_page_role role = page->role;
> - role.invalid = true;
> -
> - /* No need to use cmpxchg, only the invalid bit can change. */
> - role.word = xchg(&page->role.word, role.word);
> - return role.invalid;
> -}
> -
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> @@ -134,45 +134,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> return;
>
> - WARN_ON(!is_tdp_mmu_page(root));
> -
> /*
> - * The root now has refcount=0. It is valid, but readers already
> - * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
> - * rejects it. This remains true for the rest of the execution
> - * of this function, because readers visit valid roots only
> - * (except for tdp_mmu_zap_root_work(), which however
> - * does not acquire any reference itself).
> - *
> - * Even though there are flows that need to visit all roots for
> - * correctness, they all take mmu_lock for write, so they cannot yet
> - * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
> - * since the root still has refcount=0.
> - *
> - * However, tdp_mmu_zap_root can yield, and writers do not expect to
> - * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
> - * So the root temporarily gets an extra reference, going to refcount=1
> - * while staying invalid. Readers still cannot acquire any reference;
> - * but writers are now allowed to run if tdp_mmu_zap_root yields and
> - * they might take an extra reference if they themselves yield.
> - * Therefore, when the reference is given back by the worker,
> - * there is no guarantee that the refcount is still 1. If not, whoever
> - * puts the last reference will free the page, but they will not have to
> - * zap the root because a root cannot go from invalid to valid.
> + * The TDP MMU itself holds a reference to each root until the root is
> + * explicitly invalidated, i.e. the final reference should be never be
> + * put for a valid root.
> */
> - if (!kvm_tdp_root_mark_invalid(root)) {
> - refcount_set(&root->tdp_mmu_root_count, 1);
> -
> - /*
> - * Zapping the root in a worker is not just "nice to have";
> - * it is required because kvm_tdp_mmu_invalidate_all_roots()
> - * skips already-invalid roots. If kvm_tdp_mmu_put_root() did
> - * not add the root to the workqueue, kvm_tdp_mmu_zap_all_fast()
> - * might return with some roots not zapped yet.
> - */
> - tdp_mmu_schedule_zap_root(kvm, root);
> - return;
> - }
> + KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_del_rcu(&root->link);
> @@ -320,7 +287,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> root = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_sp(root, NULL, 0, role);
>
> - refcount_set(&root->tdp_mmu_root_count, 1);
> + /*
> + * TDP MMU roots are kept until they are explicitly invalidated, either
> + * by a memslot update or by the destruction of the VM. Initialize the
> + * refcount to two; one reference for the vCPU, and one reference for
> + * the TDP MMU itself, which is held until the root is invalidated and
> + * is ultimately put by tdp_mmu_zap_root_work().
> + */
> + refcount_set(&root->tdp_mmu_root_count, 2);
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> @@ -964,10 +938,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> {
> struct kvm_mmu_page *root;
>
> - lockdep_assert_held_write(&kvm->mmu_lock);
> + /* No need to hold mmu_lock when the VM is being destroyed. */
> + if (refcount_read(&kvm->users_count))
> + lockdep_assert_held_write(&kvm->mmu_lock);
> +
> list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> - if (!root->role.invalid &&
> - !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
> + if (!root->role.invalid) {
> root->role.invalid = true;
> tdp_mmu_schedule_zap_root(kvm, root);
> }
>
> base-commit: 62cf1e941a1169a5e8016fd8683d4d888ab51e01

Thank you, I just tested this and it works wonderfully! Is this still on time for 6.3?
In case you need it:

Tested-by: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>

I'd also like to get this backported all the way back to 5.15 because the issue is
already present there. I tried it myself, but this was before async zap and i'm
doing something wrong with refcounts:

I tried:

kvm_mmu_uninit_tdp_mmu()
{
kvm_tdp_mmu_invalidate_all_roots(kvm);
kvm_tdp_mmu_zap_invalidated_roots(kvm);
}

and dropping the refcount_inc_not_zero() from kvm_tdp_mmu_invalidate_all_roots(),
but I hit:

[ 56.163528] refcount_t: underflow; use-after-free.
[ 56.163533] WARNING: CPU: 4 PID: 1137 at lib/refcount.c:28 refcount_warn_saturate+0x9f/0x150
[ 56.163539] Modules linked in: xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls cfg80211 binfmt_misc nls_iso8859_1 xt_tcpudp nft_compat nf_tables libcrc32c nfnetlink kvm_amd ccp kvm hyperv_drm drm_kms_helper crct10dif_pclmul joydev crc32_pclmul ghash_clmulni_intel hid_generic cec rc_core aesni_intel fb_sys_fops syscopyarea crypto_simd hid_hyperv sysfillrect serio_raw cryptd hyperv_keyboard sysimgblt hv_netvsc hid hyperv_fb mac_hid pata_acpi dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops reed_solomon drm efi_pstore i2c_core ip_tables x_tables autofs4
[ 56.163568] CPU: 4 PID: 1137 Comm: qemu-system-x86 Not tainted 5.15.107-00003-g2ee068e4a996 #1
[ 56.163570] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 56.163572] RIP: 0010:refcount_warn_saturate+0x9f/0x150
[ 56.163574] Code: cc cc 0f b6 1d d3 4a 9a 01 80 fb 01 0f 87 3f d1 69 00 83 e3 01 75 e1 48 c7 c7 e8 59 bd a3 c6 05 b7 4a 9a 01 01 e8 df 2b 66 00 <0f> 0b eb ca 0f b6 1d aa 4a 9a 01 80 fb 01 0f 87 ff d0 69 00 83 e3
[ 56.163575] RSP: 0018:ffffbefb853f3c60 EFLAGS: 00010282
[ 56.163577] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
[ 56.163578] RDX: ffffa00dffb1f6c8 RSI: 0000000000000001 RDI: ffffa00dffb1f6c0
[ 56.163579] RBP: ffffbefb853f3c68 R08: 0000000000000000 R09: ffffbefb853f3bf0
[ 56.163579] R10: 00000000ffffe245 R11: 0000000000000246 R12: ffff9ffe43c37590
[ 56.163580] R13: ffffbefb854f5000 R14: 0000000000000001 R15: ffffffffffffffff
[ 56.163583] FS: 00007f389bcfc6c0(0000) GS:ffffa00dffb00000(0000) knlGS:0000000000000000
[ 56.163585] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 56.163585] CR2: 00007f378c030000 CR3: 0000000101c60001 CR4: 0000000000370ee0
[ 56.163586] Call Trace:
[ 56.163588] <TASK>
[ 56.163590] kvm_tdp_mmu_put_root+0x11b/0x140 [kvm]
[ 56.163622] mmu_free_root_page+0x9a/0xd0 [kvm]
[ 56.163646] kvm_mmu_free_roots+0x149/0x1e0 [kvm]
[ 56.163666] kvm_mmu_unload+0x20/0x70 [kvm]
[ 56.163686] kvm_arch_vcpu_ioctl_run+0x9da/0x1750 [kvm]
[ 56.163709] ? kvm_vm_ioctl+0x393/0x1120 [kvm]
[ 56.163729] ? kvm_vm_ioctl+0x393/0x1120 [kvm]
[ 56.163748] kvm_vcpu_ioctl+0x2d7/0x7b0 [kvm]
[ 56.163767] ? __fget_files+0x83/0xb0
[ 56.163770] ? __fget_files+0x83/0xb0
[ 56.163772] __x64_sys_ioctl+0x98/0xd0
[ 56.163774] do_syscall_64+0x5b/0x80
[ 56.163776] ? do_syscall_64+0x67/0x80
[ 56.163777] ? exc_page_fault+0x7c/0x150
[ 56.163779] entry_SYSCALL_64_after_hwframe+0x61/0xcb

which shouldn't be possible since we always hold 2 refs.

Jeremi