Re: [PATCH v3 00/21] Enable CET Virtualization

From: Andrew Cooper
Date: Thu Jul 20 2023 - 06:46:53 EST


On 20/07/2023 9:03 am, Peter Zijlstra wrote:
> On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote:
>>>> My understanding is that PL[0-2]_SSP are used only on transitions to the
>>>> corresponding privilege level from a *different* privilege level. That means
>>>> KVM should be able to utilize the user_return_msr framework to load the host
>>>> values. Though if Linux ever supports SSS, I'm guessing the core kernel will
>>>> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
>>>> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
>>>> per-task, on every context switch.
>>>>
>>>> But note my original wording: **If that's necessary**
>>>>
>>>> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
>>>> IA32_S_CET, then running host stuff with guest values should be ok. KVM only
>>>> needs to guarantee that it doesn't leak values between guests. But that should
>>>> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
>>>> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
>>>>
>>>> And regardless of what the mechanism ends up managing SSP MSRs, it should only
>>>> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
>>>> never consume PL{1,2}_SSP.
>>> To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
>> Trying to understand more what prevents SSS to enable in pre FRED, Is
>> it better #CP exception
>> handling with other nested exceptions?
> SSS

Careful with SSS for "supervisor shadow stacks".   Because there's a
brand new CET_SSS CPUID bit to cover the (mis)feature where shstk
supervisor tokens can be *prematurely busy*.

(11/10 masterful wordsmithing, because it does lull you into the
impression that this isn't WTF^2 levels of crazy)

> took the syscall gap and made it worse -- as in *way* worse.

More impressively, it created a sysenter gap where there wasn't one
previously.

> To top it off, the whole SSS busy bit thing is fundamentally
> incompatible with how we manage to survive nested exceptions in NMI
> context.

To be clear, this is supervisor shadow stack regular busy bits, not the
CET_SSS premature busy problem.

>
> Basically, the whole x86 exception / stack switching logic was already
> borderline impossible (consider taking an MCE in the early NMI path
> where we set up, but have not finished, the re-entrancy stuff), and
> pushed it over the edge and set it on fire.
>
> And NMI isn't the only problem, the various new virt exceptions #VC and
> #HV are on their own already near impossible, adding SSS again pushes
> the whole thing into clear insanity.
>
> There's a good exposition of the whole trainwreck by Andrew here:
>
> https://www.youtube.com/watch?v=qcORS8CN0ow
>
> (that is, sorry for the youtube link, but Google is failing me in
> finding the actual Google Doc that talk is based on, or even the slide
> deck :/)

https://docs.google.com/presentation/d/10vWC02kpy4QneI43qsT3worfF_e3sbAE3Ifr61Sq3dY/edit?usp=sharing
is the slide deck.

I'm very glad I put a "only accurate as of $PRESENTATION_DATE"
disclaimer on slide 14.  It makes the whole presentation still
technically correct.

FRED is now at draft 5, and importantly shstk tokens have been removed. 
They've been replaced with alternative MSR-based mechanism, mostly for
performance reasons but a consequence is that the prematurely busy bug
can't happen.

~Andrew