Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack

From: Andy Lutomirski
Date: Thu Feb 18 2021 - 19:29:50 EST


On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@xxxxxxx> wrote:
>
> On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> > I don't understand what this means. The whole entry mechanism on x86
> > is structured so that we call a C function *and return from that C
> > function without longjmp-like magic* with the sole exception of
> > unwind_stack_do_exit(). This means that you can match up enters and
> > exits, and that unwind_stack_do_exit() needs to unwind correctly. In
> > the former case, it's normal C and we can use normal local variables.
> > In the latter case, we know exactly what state we're trying to restore
> > and we can restore it directly without any linked lists or similar.
>
> Okay, the unwinder will likely get confused by this logic.
>
> > What do you have in mind that requires a linked list?
>
> Cases when there are multiple IST vectors besides NMI that can hit while
> the #VC handler is still on its own IST stack. #MCE comes to mind, but
> that is broken anyway. At some point #VC itself will be one of them, but
> when that happens the code will kill the machine.
> This leaves #HV in the list, and I am not sure how that is going to be
> handled yet. I think the goal is that the #HV handler is not allowed to
> cause any #VC exception. In that case the linked-list logic will not be
> needed.

Can you give me an example, even artificial, in which the linked-list
logic is useful?

>
> > > I don't see how this would break, can you elaborate?
> > >
> > > What I think happens is:
> > >
> > > SYSCALL gap (RSP is from userspace and untrusted)
> > >
> > > -> #VC - Handler on #VC IST stack detects that it interrupted
> > > the SYSCALL gap and switches to the task stack.
> > >
> >
> > Can you point me to exactly what code you're referring to? I spent a
> > while digging through the code and macro tangle and I can't find this.
>
> See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
> creates the assembly code for the handler. At some point it calls
> vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
> This function tries to find a new stack for the #VC handler.
>
> The first thing it does is checking whether the exception came from the
> SYSCALL gap and just uses the task stack in that case.
>
> Then it will check for other kernel stacks which are safe to switch
> to. If that fails it uses the fall-back stack (VC2), which will direct
> the handler to a separate function which, for now, just calls panic().
> Not safe are the entry or unknown stacks.

Can you explain your reasoning in considering the entry stack unsafe?
It's 4k bytes these days.

You forgot about entry_SYSCALL_compat.

Your 8-byte alignment is confusing to me. In valid kernel code, SP
should be 8-byte-aligned already, and, if you're trying to match
architectural behavior, the CPU aligns to 16 bytes.

We're not robust against #VC, NMI in the #VC prologue before the magic
stack switch, and a new #VC in the NMI prologue. Nor do we appear to
have any detection of the case where #VC nests directly inside its own
prologue. Or did I miss something else here?

If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
looks like it will merrily run itself in the NMI special-stack-layout
section, and that sounds really quite bad.

>
> The function then copies pt_regs and returns the new stack pointer to
> assembly code, which then writes it to %RSP.
>
> > Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> > that #DB is *not* always called in safe places.
>
> You are right, forgot about this. The MOV SS bug can very well
> trigger a #VC(#DB) exception from the syscall gap.
>
> > > And with SNP we need to be able to at least detect a malicious HV so we
> > > can reliably kill the guest. Otherwise the HV could potentially take
> > > control over the guest's execution flow and make it reveal its secrets.
> >
> > True. But is the rest of the machinery to be secure against EFLAGS.IF
> > violations and such in place yet?
>
> Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
> while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
> anywhere in its call-path. If that ever happens it is a bug.
>

I mean that, IIRC, a malicious hypervisor can inject inappropriate
vectors at inappropriate times if the #HV mechanism isn't enabled.
For example, it could inject a page fault or an interrupt in a context
in which we have the wrong GSBASE loaded.

But the #DB issue makes this moot. We have to use IST unless we turn
off SCE. But I admit I'm leaning toward turning off SCE until we have
a solution that seems convincingly robust.