Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction

From: mlevitsk
Date: Tue Jan 23 2024 - 09:49:11 EST


On Tue, 2024-01-02 at 11:57 -0800, Sean Christopherson wrote:
> > > > +Maxim
> > > >
> > > > On Fri, Dec 22, 2023, Michal Wilczynski wrote:
> > > > > > > > Syzkaller found a warning triggered in
> > > > > > > > nested_vmx_vmexit().
> > > > > > > > vmx->nested.nested_run_pending is non-zero, even though
> > > > > > > > we're in
> > > > > > > > nested_vmx_vmexit(). Generally, trying to cancel a
> > > > > > > > pending entry is
> > > > > > > > considered a bug. However in this particular scenario,
> > > > > > > > the kernel
> > > > > > > > behavior seems correct.
> > > > > > > >
> > > > > > > > Syzkaller scenario:
> > > > > > > > 1) Set up VCPU's
> > > > > > > > 2) Run some code with KVM_RUN in L2 as a nested guest
> > > > > > > > 3) Return from KVM_RUN
> > > > > > > > 4) Inject KVM_SMI into the VCPU
> > > > > > > > 5) Change the EFER register with KVM_SET_SREGS to value
> > > > > > > > 0x2501
> > > > > > > > 6) Run some code on the VCPU using KVM_RUN
> > > > > > > > 7) Observe following behavior:
> > > > > > > >
> > > > > > > > kvm_smm_transition: vcpu 0: entering SMM, smbase
> > > > > > > > 0x30000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8002
> > > > > > > > kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
> > > > > > > > kvm_nested_vmenter: rip: 0x0000000000008002 vmcs:
> > > > > > > > 0x0000000000007000
> > > > > > > >                     nested_rip: 0x0000000000000000
> > > > > > > > int_ctl: 0x00000000
> > > > > > > > event_inj: 0x00000000 nested_ept=n
> > > > > > > > guest
> > > > > > > > cr3: 0x0000000000002000
> > > > > > > > kvm_nested_vmexit_inject: reason: TRIPLE_FAULT
> > > > > > > > ext_inf1: 0x0000000000000000
> > > > > > > >                           ext_inf2: 0x0000000000000000
> > > > > > > > ext_int: 0x00000000
> > > > > > > > ext_int_err: 0x00000000
> > > > > > > >
> > > > > > > > What happened here is an SMI was injected immediately
> > > > > > > > and the handler was
> > > > > > > > called at address 0x8000; all is good. Later, an RSM
> > > > > > > > instruction is
> > > > > > > > executed in an emulator to return to the nested VM.
> > > > > > > > em_rsm() is called,
> > > > > > > > which leads to emulator_leave_smm(). A part of this
> > > > > > > > function calls VMX/SVM
> > > > > > > > callback, in this case vmx_leave_smm(). It attempts to
> > > > > > > > set up a pending
> > > > > > > > reentry to guest VM by calling
> > > > > > > > nested_vmx_enter_non_root_mode() and sets
> > > > > > > > vmx->nested.nested_run_pending to one. Unfortunately,
> > > > > > > > later in
> > > > > > > > emulator_leave_smm(), rsm_load_state_64() fails to
> > > > > > > > write invalid EFER to
> > > > > > > > the MSR. This results in em_rsm() calling triple_fault
> > > > > > > > callback. At this
> > > > > > > > point it's clear that the KVM should call the vmexit,
> > > > > > > > but
> > > > > > > > vmx->nested.nested_run_pending is left set to 1. To fix
> > > > > > > > this reset the
> > > > > > > > vmx->nested.nested_run_pending flag in triple_fault
> > > > > > > > handler.
> > > > > > > >
> > > > > > > > TL;DR (courtesy of Yuan Yao)
> > > > > > > > Clear nested_run_pending in case of RSM failure on
> > > > > > > > return from L2 SMM.
> > > >
> > > > KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces
> > > > a syntethic nested
> > > > VM-Exit to get from L2 to L1, and then emulates SMM in the
> > > > context of L1.
> > > >
> > > > > > > > The pending VMENTRY to L2 should be cancelled due to
> > > > > > > > such failure leads
> > > > > > > > to triple fault which should be injected to L1.
> > > > > > > >
> > > > > > > > Possible alternative approach:
> > > > > > > > While the proposed approach works, the concern is that
> > > > > > > > it would be
> > > > > > > > simpler, and more readable to cancel the
> > > > > > > > nested_run_pending in em_rsm().
> > > > > > > > This would, however, require introducing new callback
> > > > > > > > e.g,
> > > > > > > > post_leave_smm(), that would cancel nested_run_pending
> > > > > > > > in case of a
> > > > > > > > failure to resume from SMM.
> > > > > > > >
> > > > > > > > Additionally, while the proposed code fixes VMX
> > > > > > > > specific issue, SVM also
> > > > > > > > might suffer from similar problem as it also uses it's
> > > > > > > > own
> > > > > > > > nested_run_pending variable.
> > > > > > > >
> > > > > > > > Reported-by: Zheyu Ma <zheyuma97@xxxxxxxxx>
> > > > > > > > Closes:
> > > > > > > > https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@xxxxxxxxxxxxxx
> > > >
> > > > Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set
> > > > nested_run_pending on VM entry which is a result of RSM")
> > > >
> > > > > > > > Signed-off-by: Michal Wilczynski
> > > > > > > > <michal.wilczynski@xxxxxxxxx>
> > > > > > > > ---
> > > > > > > >  arch/x86/kvm/vmx/nested.c | 9 +++++++++
> > > > > > > >  1 file changed, 9 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/arch/x86/kvm/vmx/nested.c
> > > > > > > > b/arch/x86/kvm/vmx/nested.c
> > > > > > > > index c5ec0ef51ff7..44432e19eea6 100644
> > > > > > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > > > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > > > > > @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct
> > > > > > > > kvm_vcpu *vcpu, u32 vm_exit_reason,
> > > > > > > >  
> > > > > > > >  static void nested_vmx_triple_fault(struct kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > >  {
> > > > > > > > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > > > > > +
> > > > > > > >   kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > > > > > > > +
> > > > > > > > + /* In case of a triple fault, cancel the
> > > > > > > > nested reentry. This may occur
> > > >
> > > > /*
> > > > * Multi-line comments should look like this. Blah
> > > > blah blab blah blah
> > > > * blah blah blah blah.
> > > > */
> > > >
> > > > > > > > + * when the RSM instruction fails while
> > > > > > > > attempting to restore the state
> > > > > > > > + * from SMRAM.
> > > > > > > > + */
> > > > > > > > + vmx->nested.nested_run_pending = 0;
> > > >
> > > > Argh. KVM's handling of SMIs while L2 is active is complete
> > > > garbage. As explained
> > > > by the comment in vmx_enter_smm(), the L2<->SMM transitions
> > > > should have a completely
> > > > custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.
> > > >
> > > > /*
> > > > * TODO: Implement custom flows for forcing the vCPU
> > > > out/in of L2 on
> > > > * SMI and RSM. Using the common VM-Exit + VM-Enter
> > > > routines is wrong
> > > > * SMI and RSM only modify state that is saved and
> > > > restored via SMRAM.
> > > > * E.g. most MSRs are left untouched, but many are
> > > > modified by VM-Exit
> > > > * and VM-Enter, and thus L2's values may be corrupted
> > > > on SMI+RSM.
> > > > */
> > > >
> > > > The Fixes: commit above added a hack on top of the hack. But
> > > > it's not entirely
> > > > clear from the changelog exactly what was being fixed.
> > > >
> > > >     While RSM induced VM entries are not full VM entries,
> > > >     they still need to be followed by actual VM entry to
> > > > complete it,
> > > >     unlike setting the nested state.
> > > >     
> > > >     This patch fixes boot of hyperv and SMM enabled
> > > >     windows VM running nested on KVM, which fail due
> > > >     to this issue combined with lack of dirty bit setting.
> > > >
> > > > My first guess would be event injection, but that shouldn't be
> > > > relevant to RSM.
> > > > Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized
> > > > at instruction
> > > > boundaries, but except for SMIs (see below), KVM naturally
> > > > defers injection until
> > > > an instruction boundary by virtue of delivering events via the
> > > > VMCS/VMCB, i.e. by
> > > > waiting to deliver events until successfully re-entering the
> > > > guest.
> > > >
> > > > Nested VM-Enter is a special snowflake because KVM needs to
> > > > finish injecting events
> > > > from vmcs12 before injecting any synthetic events, i.e.
> > > > nested_run_pending ensures
> > > > that KVM wouldn't clobber/override an L2 event coming from L1.
> > > > In other words,
> > > > nested_run_pending is much more specific than just needing to
> > > > wait for an instruction
> > > > boundary.
> > > >
> > > > So while the "wait until the CPU is at an instruction boundary"
> > > > applies to RSM,
> > > > it's not immediately obvious to me why setting
> > > > nested_run_pending is necessary.

If nested_run_pending is not set, then nested entry can be aborted by
an event
(e.g another #SMI, or just an interrupt since after VM entry the
interrupts are usually enabled,
and interrupts are intercepted usually).

It is even possible for a nested migration to happen right after the
RSM is executed but before
actual VMENTER is executed.

> > > > And setting nested_run_pending *after* calling
> > > > nested_vmx_enter_non_root_mode()
> > > > is nasty. nested_vmx_enter_non_root_mode() and its helpers use
> > > > nested_run_pending
> > > > to determine whether or not to enforce various consistency
> > > > checks and other
> > > > behaviors. And a more minor issue is that stat.nested_run will
> > > > be incorrectly
> > > > incremented.
> > > >
> > > > As a stop gap, something like this patch is not awful, though I
> > > > would strongly
> > > > prefer to be more precise and not clear it on all triple
> > > > faults. We've had KVM
> > > > bugs where KVM prematurely synthesizes triple fault on an
> > > > actual nested VM-Enter,
> > > > and those would be covered up by this fix.
> > > >
> > > > But due to nested_run_pending being (unnecessarily) buried in
> > > > vendor structs, it
> > > > might actually be easier to do a cleaner fix. E.g. add yet
> > > > another flag to track
> > > > that a hardware VM-Enter needs to be completed in order to
> > > > complete instruction
> > > > emulation.
> > > >
> > > > And as alluded to above, there's another bug lurking. Events
> > > > that are *emulated*
> > > > by KVM must not be emulated until KVM knows the vCPU is at an
> > > > instruction boundary.
> > > > Specifically, enter_smm() shouldn't be invoked while KVM is in
> > > > the middle of
> > > > instruction emulation (even if "emulation" is just setting
> > > > registers and skipping
> > > > the instruction). Theoretically, that could be fixed by
> > > > honoring the existing
> > > > at_instruction_boundary flag for SMIs, but that'd be a rather
> > > > large change and
> > > > at_instruction_boundary is nowhere near accurate enough to use
> > > > right now.
> > > >
> > > > Anyways, before we do anything, I'd like to get Maxim's input
> > > > on what exactly was
> > > > addressed by 759cbd59674a.
> > > >

There is a good explanation in commit 759cbd59674a and
commit e8efa4ff0037 that comes before it.

In a nutshell, there were two problems that together created
conditions for the guest boot failure:

1. When a non intercepted #SMI arrives while L2 is running, it's
almost like a VM exit from L2 to L1, however instead of restoring
the host state, KVM jumps to the SMI handler in real mode in L1.
Thus to be able later to run a regular VM exit handler, KVM stores
the host state (which is present in vmcb01) to HSAVE area and on RSM
restores it from HSAVE to vmcb01.
The bug was that KVM wasn't marking vmcb01 as dirty when restoring
the host state.
Usually this is not a problem because in this case we resume L2 and
thus switch to vmcb02.

2. Due to lack of setting the nested_run_pending, the above switch to
vmcb02 might not complete if there are pending events, which is allowed
because RSM doesn't have interrupt shadow, thus once we finished
emulating RSM, we are in the L2 guest and likely with interrupts
enabled.

If the switch to vmcb02 doesn't complete we end up still running
with vmcb01 and with lots of fields that were changed but not
marked as dirty.

Now usually this works because the CPU isn't that agressive in caching
vmcb fields, but if the whole thing is running nested under KVM,
then kvm always caches few 'rare' vmcb fields, like the IDTR, so
we end up with stale IDTR and on next interrupt the guest explodes.


PS: this is one of the mails that I forgot to send because I moved to
Canada recently and was offline for some time due to that.

Best regards,
Maxim Levitsky