Re: [PATCH v17 092/116] KVM: TDX: Handle TDX PV HLT hypercall

From: Isaku Yamahata
Date: Tue Jan 09 2024 - 12:36:40 EST


On Tue, Jan 09, 2024 at 08:21:13AM -0800,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

> On Mon, Jan 08, 2024, Chao Gao wrote:
> > On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote:
> > >On Tue, Nov 07, 2023, isaku.yamahata@xxxxxxxxx wrote:
> > >> From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > >>
> > >> Wire up TDX PV HLT hypercall to the KVM backend function.
> > >>
> > >> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > >> ---
> > >> arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> > >> arch/x86/kvm/vmx/tdx.h | 3 +++
> > >> 2 files changed, 44 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >> index 3a1fe74b95c3..4e48989d364f 100644
> > >> --- a/arch/x86/kvm/vmx/tdx.c
> > >> +++ b/arch/x86/kvm/vmx/tdx.c
> > >> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >>
> > >> bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> > >> {
> > >> - return pi_has_pending_interrupt(vcpu);
> > >> + bool ret = pi_has_pending_interrupt(vcpu);
> > >> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > >> +
> > >> + if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> > >> + return true;
> > >> +
> > >> + if (tdx->interrupt_disabled_hlt)
> > >> + return false;
> > >> +
> > >> + /*
> > >> + * This is for the case where the virtual interrupt is recognized,
> > >> + * i.e. set in vmcs.RVI, between the STI and "HLT". KVM doesn't have
> > >> + * access to RVI and the interrupt is no longer in the PID (because it
> > >> + * was "recognized". It doesn't get delivered in the guest because the
> > >> + * TDCALL completes before interrupts are enabled.
> > >> + *
> > >> + * TDX modules sets RVI while in an STI interrupt shadow.
> > >> + * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
> > >> + * The interrupt shadow at this point is gone.
> > >> + * - It knows that there is an interrupt that can be delivered
> > >> + * (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
> > >> + * matter)
> > >> + * - It forwards the TDExit nevertheless, to a clueless hypervisor that
> > >> + * has no way to glean either RVI or PPR.
> > >
> > >WTF. Seriously, what in the absolute hell is going on. I reported this internally
> > >four ***YEARS*** ago. This is not some obscure theoretical edge case, this is core
> > >functionality and it's completely broken garbage.
> > >
> > >NAK. Hard NAK. Fix the TDX module, full stop.
> > >
> > >Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
> > >interrupts that are pending in RVI when handling NESTED VM-Enter. Really!?!?!
> > >Y'all went and added nested virtualization support of some kind, but can't find
> > >the time to get the basics right?
> >
> > We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt
> > Delivery Indication in TDX module 1.5 spec [1]
> >
> > The host VMM can detect whether a virtual interrupt is pending delivery to a
> > VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS
> > TDVPS field.
> >
> > The typical use case is when the guest TD VCPU indicates to the host VMM, using
> > TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is
> > expected to pass an “interrupt blocked” flag. The guest TD is expected to set
> > this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that
> > invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt
> > blocked” flag is 0, the host VMM can determine whether to re-schedule the guest
> > TD VCPU based on VCPU_STATE_DETAILS.
> >
> > Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something
> > during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading
> > VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is
> > costly, ~3-4K cycles)?
>
> *sigh* Why only earth doesn't the TDX module simply compute VMXIP on TDVMCALL?
> It's literally one bit and one extra VMREAD. There are plenty of register bits
> available, and I highly doubt ~20 cycles in the TDVMCALL path will be noticeable,
> let alone problematic. Such functionality could even be added on top in a TDX
> module update, and Intel could even bill it as a performance optimization.
>
> Eating 4k cycles in the HLT path isn't the end of the world, but it's far from
> optimal and it's just so incredibly wasteful. I wouldn't be surprised if the
> latency is measurable for certain workloads, which will lead to guests using
> idle=poll and/or other games being played in the guest.
>
> And AFAICT, the TDX module doesn't support HLT passthrough, so fully dedicated
> CPUs can't even mitigate the pain that way.
>
> Anyways, regarding the "workaround", my NAK stands. It has bad tradeoffs of its
> own, e.g. will result in spurious wakeups, and can't possibly work for VMs with
> passthrough devices. Not to mention that the implementation has several races
> and false positives.

I'll drop the last part and use TDH.VP.RD(VCPU_STATE_DETAILS) with TDX 1.5.
--
Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>>