Re: [RESEND PATCH 5/6] KVM: x86/VMX: add kvm_vmx_reinject_nmi_irq() for NMI/IRQ reinjection

From: Sean Christopherson
Date: Thu Nov 17 2022 - 10:51:47 EST


On Thu, Nov 17, 2022, Li, Xin3 wrote:
>
> > > > > > But what about NMIs, afaict this is all horribly broken for NMIs.
> > > > > > So the whole VMX thing latches the NMI (which stops NMI
> > > > > > recursion), right?
> > > > > >
> > > > > > But then you drop out of noinstr code, which means any random
> > > > > > exception can happen (kprobes #BP, hw_breakpoint #DB, or even
> > > > > > #PF due to random nonsense like *SAN). This exception will do
> > > > > > IRET and clear the NMI latch, all before you get to run any of the
> > > > > > NMI code.
> > > > >
> > > > > What you said here implies that we have this problem in the existing code.
> > > > > Because a fake iret stack is created to call the NMI handler in
> > > > > the IDT NMI descriptor, which lastly executes the IRET instruction.
> > > >
> > > > I can't follow; of course the IDT handler terminates with IRET, it has to no?
> > > >
> > > > And yes, the current code appears to suffer the same defect.

That defect isn't going to be fixed simply by changing how KVM forwards NMIs
though. IIUC, _everything_ between VM-Exit and the invocation of the NMI handler
needs to be noinstr. On VM-Exit due to NMI, NMIs are blocked. If a #BP/#DB/#PF
occurs before KVM gets to kvm_x86_handle_exit_irqoff(), the subsequent IRET will
unblock NMIs before the original NMI is serviced, i.e. a second NMI could come in
at anytime regardless of how KVM forwards the NMI to the kernel.

Is there any way to solve this without tagging everything noinstr? There is a
metric shit ton of code between VM-Exit and the handling of NMIs, and much of that
code is common helpers. It might be possible to hoist NMI handler much earlier,
though we'd need to do a super thorough audit to ensure all necessary host state
is restored.

> > > With FRED, ERETS/ERETU replace IRET, and use bit 28 of the popped CS
> > > field to control whether to unblock NMI. If bit 28 of the field (above
> > > the selector) is 1, ERETS/ERETU unblocks NMIs.

Side topic, there's a bug in the ISE docs. Section "9.4.2 NMI Blocking" states
that bit 16 holds the "unblock NMI" magic, which I'm guessing is a holdover from
an earlier revision of FRED.

As specified in Section 6.1.3 and Section 6.2.3, ERETS and ERETU each unblocks NMIs
if bit 16 of the popped CS field is 1. The following items detail how this behavior may be
changed in VMX non-root operation, depending on the settings of certain VM-execution
controls:

> > Yes, I know that. It is one of the many improvements FRED brings.
> > Ideally the IBT WAIT-FOR-ENDBR state also gets squirreled away in the
> > hardware exception frame, but that's still up in the air I believe :/
> >
> > Anyway.. given there is interrupt priority and NMI is pretty much on top of
> > everything else the reinject crap *should* run NMI first. That way NMI runs
> > with the latch disabled and whatever other pending interrupts will run later.
> >
> > But that all is still broken because afaict the current code also leaves noinstr --
> > and once you leave noinstr (or use a static_key, static_call or anything else that
> > *can* change at runtime) you can't guarantee nothing.
>
> For NMI, HPA asked me to use "int $2", as it switches to the NMI IST stack to
> execute the NMI handler, essentially like how HW deals with a NMI in host. and
> I tested it with NMI watchdog, it looks working fine.

Heh, well yeah, because that's how KVM used to handle NMIs back before I reworked
NMI handling to use the direct call method. Ironically, that original change was
done in part to try and make it _easier_ to deal with FRED (back before FRED was
publicly disclosed).

If KVM reverts to INTn, the fix to route KVM=>NMI through the non-IST entry can
be reverted too.

a217a6593cec ("KVM/VMX: Invoke NMI non-IST entry instead of IST entry")
1a5488ef0dcf ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn")