Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP

From: Huang, Kai
Date: Wed Jul 05 2023 - 07:35:20 EST


On Wed, 2023-07-05 at 12:21 +0200, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 12:15:13PM +0000, Huang, Kai wrote:
> >
> > >
> > > So I think the below deals with everything and unifies __tdx_hypercall()
> > > and __tdx_module_call(), since both sides needs to deal with exactly the
> > > same trainwreck.
> >
> > Hi Peter,
> >
> > Just want to make sure I understand you correctly:
> >
> > You want to make __tdx_module_call() look like __tdx_hypercall(), but not to
> > unify them into one assembly (at least for now), right?
>
> Well, given the horrendous trainwreck this is all turning into, I
> through it prudent to have it all in a single place. The moment you go
> play games with callee-saved registers you're really close to what
> hypercall does so then they might as well be the same.

OK I understand you now. Thanks.

Yeah I think from long-term's view, since SEAMCALLs to support live migration
pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
unify all of them, although I guess there might be some special handling to
VP.VMCALL and/or VP.ENTER, e.g., below:

/* TDVMCALL leaf return code is in R10 */
movq %r10, %rax

So long-termly, I don't have objection to that. But my thinking is for the
first version of TDX host support, we don't have to support all SEAMCALLs but
only those involved in basic TDX support. Those SEAMCALLs (except VP.ENTER)
only uses RCX/RDX/R8/R9 as input and RCX/RDX/R8-R11 as output, so to me it looks
fine to only make __tdx_module_call() look like __tdx_hypercall() as the first
step.

Also, the new SEAMCALLs to handle live migration all seem to have below
statement:

AVX, AVX2 May be reset to the architectural INIT state
and
AVX512
state

Which means those SEAMCALLs need to preserve AVX* states too?

And reading the spec, the VP.VMCALL and VP.ENTER also can use XMM0 - XMM15 as
input/output. Linux VP.VMCALL seems doesn't support using XMM0 - XMM15 as
input/output, but KVM can run other guest OSes too so I think KVM VP.ENTER needs
to handle XMM0-XMM15 as input/output too.

That being said, I think although we can provide a common asm macro to cover
VP.ENTER, I suspect KVM still needs to do additional assembly around the macro
too. So I am not sure whether we should try to cover VP.ENTER.

And I don't want to speak for KVM maintainers. :)

Hi Sean/Paolo, do you have any comments here?

>
> > I am confused you mentioned VP.VMCALL below, which is handled by
> > __tdx_hypercall().
>
> But why? It really isn't *that* special if you consider the other calls
> that are using callee-saved regs, yes it has the rdi/rsi extra, but meh,
> it really just is tdcall-0.

As mentioned above I don't have objection to this :)

>
>
> > > *-------------------------------------------------------------------------
> > > * TDCALL/SEAMCALL ABI:
> > > *-------------------------------------------------------------------------
> > > * Input Registers:
> > > *
> > > * RAX - Leaf number.
> > > * RCX,RDX,R8-R11 - Leaf specific input registers.
> > > * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
> > > *
> > > * Output Registers:
> > > *
> > > * RAX - instruction error code.
> > > * RCX,RDX,R8-R11 - Leaf specific output registers.
> > > * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER
> >
> > As mentioned above, VP.VMCALL is handled by __tdx_hypercall(). Also, VP.ENTER
> > will be handled by KVM's own assembly. They both are not handled in this
> > TDX_MODULE_CALL assembly.
>
> I don't think they should be special, they're really just yet another
> leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
> terminally broken for unconditionally clobbering BP :-(
>
> That really *must* be fixed.

Sure I don't have objection to this, and for VP.ENTER please see above.

But I'd like to say that, generally speaking, from virtualization's point of
view, guest has its own BP and conceptually the hypervisor needs to restore
guest's BP before jumping to the guest. E.g., for normal VMX guest, KVM always
restores guest's BP before VMENTER (arch/x86/kvm/vmx/vmenter.S):

SYM_FUNC_START(__vmx_vcpu_run)
push %_ASM_BP
mov %_ASM_SP, %_ASM_BP

...
mov VCPU_RBP(%_ASM_AX), %_ASM_BP
...
vmenter/vmresume
...
SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
.....
mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
...
pop %_ASM_BP
RET

>
> > > .Lcall:
> > > .if \host
> > > seamcall
> > > /*
> > > * SEAMCALL instruction is essentially a VMExit from VMX root
> > > * mode to SEAM VMX root mode. VMfailInvalid (CF=1) indicates
> > > * that the targeted SEAM firmware is not loaded or disabled,
> > > * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
> > > * changed in this case.
> > > */
> > > jc .Lseamfail
> > >
> > > .if \saved && \ret
> > > /*
> > > * VP.ENTER clears RSI on output, use it to restore state.
> > > */
> > > popq %rsi
> > > xor %edi,%edi
> > > movq %rdi, TDX_MODULE_rdi(%rsi)
> > > movq %rdi, TDX_MODULE_rsi(%rsi)
> > > .endif
> > > .else
> > > tdcall
> > >
> > > /*
> > > * RAX!=0 indicates a failure, assume no return values.
> > > */
> > > testq %rax, %rax
> > > jne .Lerror
> >
> > For some SEAMCALL/TDCALL the output registers may contain additional error
> > information. We need to jump to a location where whether returning those
> > additional regs to 'struct tdx_module_args' depends on \ret.
>
> I suppose we can move this into the below conditional :-( The [DS]I
> register stuff requires a scratch reg to recover, AX being zero provides
> that.

Yeah this can certainly be done in one way or another.

>
> > > .if \saved && \ret
> > > /*
> > > * Since RAX==0, it can be used as a scratch register to restore state.
> > > *
> > > * [ assumes \saved implies \ret ]
> > > */
> > > popq %rax
> > > movq %rdi, TDX_MODULE_rdi(%rax)
> > > movq %rsi, TDX_MODULE_rsi(%rax)
> > > movq %rax, %rsi
> > > xor %eax, %eax;
> > > .endif
> > > .endif // \host
>
> So the reason I want this, is that I feel very strongly that if you
> cannot write a single coherent wrapper for all this, its calling
> convention is fundamentally *too* complex / broken.

In general I agree, but I am not sure whether there's any detail holding us
back. :)