Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V

From: Sean Christopherson
Date: Thu Apr 13 2023 - 13:24:26 EST

Next message: syzbot: "[syzbot] [block?] WARNING in fd_locked_ioctl"
Previous message: Ackerley Tng: "Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory"
In reply to: Jeremi Piotrowski: "Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V"
Next in thread: Sean Christopherson: "Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 13, 2023, Jeremi Piotrowski wrote:
> On 4/11/2023 6:02 PM, Sean Christopherson wrote:
> > By default, yes. I double checked that L2 has similar boot times for KVM-on-KVM
> > with and without the TDP MMU. Certainly nothing remotely close to 2 minutes.
>
> Something I just noticed by tracing hv_track_root_tdp is that the VM appears to go through
> some ~10000 unique roots in the period before kernel init starts (so grub + kernel decompression).
> That part seems to take a long time. Is this kind of churn of roots by design?
>
> The ftrace output for when the root changes looks something like this, kvm goes through smm emulation
> during the exit.
>
> qemu-system-x86-18971 [015] d.... 95922.997039: kvm_exit: vcpu 0 reason EXCEPTION_NMI rip 0xfd0bd info1 0x0000000000000000 info2 0x0000000000000413 intr_info 0x80000306 error_code 0x00000000
> qemu-system-x86-18971 [015] ..... 95922.997052: p_hv_track_root_tdp_0: (hv_track_root_tdp+0x0/0x70 [kvm]) si=0x18b082000
> qemu-system-x86-18971 [015] d.... 95922.997133: kvm_entry: vcpu 0, rip 0xf7d6b
>
> There are also root changes after IO_INSTRUCTION exits. When I look at non-tdp-mmu it seems to cycle between two
> roots in that phase time, and tdp-mmu allocates new ones instead.

#$&*#$*& SMM. I know _exactly_ what's going on.

When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM,
KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous
roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes
all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root
(from the vCPU's perspective).

In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM
hash table. Unloading a "shadow" root just wipes it from the per-vCPU cache; the
root is still tracked in the per-VM hash table. When KVM loads a "new" root for the
vCPU, KVM will find the old, unloaded root in the per-VM hash table.

But unloading roots is anathema for the TDP MMU. Unlike the shadow MMU, the TDP MMU
doesn't track _inactive_ roots in a per-VM structure, where "active" in this case
means a root is either in-use or cached as a previous root by at least one vCPU.
When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is
put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing
may be done by a worker, but for all intents and purposes the root is gone).

The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all
roots effectively frees all roots. Wwhereas in a multi-vCPU setup, a different vCPU
usually holds a reference to an unloaded root and thus keeps the root alive, allowing
the vCPU to reuse its old root after unloading (with a flush+sync).

What's happening in your case is that legacy BIOS does some truly evil crud with
SMM, and can transition to/from SMM thousands of time during boot. On *every*
transition, KVM unloads its roots, i.e. KVM has to teardown, reallocate, and rebuild
a new root every time the vCPU enters SMM, and every time the vCPU exits SMM.

This exact problem was reported by the grsecurity folks when the guest toggles CR0.WP.
We duct taped a solution together for CR0.WP[1], and now finally have a more complete
fix lined up for 6.4[2], but the underlying flaw of the TDP MMU not preserving inactive
roots still exists.

Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter.
4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping
inactive roots on a per-VM basis is just monumentally stupid. Ugh, and that's not
even the worst of our stupidity. The truly awful side of all this is that we
spent an absurd amount of time getting kvm_tdp_mmu_put_root() to play nice with
putting the last reference to a valid root while holding mmu_lock for read.

Give me a few hours to whip together and test a patch, I think I see a way to fix
this without a massive amount of churn, and with fairly simple rules for how things
work.

[1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
[2] https://lore.kernel.org/all/20230322013731.102955-1-minipli@xxxxxxxxxxxxxx

Next message: syzbot: "[syzbot] [block?] WARNING in fd_locked_ioctl"
Previous message: Ackerley Tng: "Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory"
In reply to: Jeremi Piotrowski: "Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V"
Next in thread: Sean Christopherson: "Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]