Re: [PATCH v9 23/42] Documentation/x86: Add CET shadow stack description

From: szabolcs.nagy@xxxxxxx
Date: Thu Jul 06 2023 - 09:08:28 EST


The 07/05/2023 18:45, Edgecombe, Rick P wrote:
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@xxxxxxx wrote:
> > Could you spell out what "the issue" is that can be triggered?
> >
> > i meant jumping back from the main to the alt stack:
> >
> > in main:
> > setup sig alt stack
> > setjmp buf1
> >         raise signal on first return
> >         longjmp buf2 on second return
> >
> > in signal handler:
> > setjmp buf2
> >         longjmp buf1 on first return
> >         can continue after second return
> >
> > in my reading of posix this is valid (and works if signals are masked
> > such that the alt stack is not clobbered when jumping away from it).
> >
> > but cannot work with a single shared shadow stack.
>
> Ah, I see. To make this work seamlessly, you would need to have
> automatic alt shadow stacks, and as we previously discussed this is not
> possible with the existing sigaltstack API. (Or at least it seemed like
> a closed discussion to me).
>
> If there is a solution, then we are currently missing a detailed
> proposal. It looks like further down you proposed leaking alt shadow
> stacks (quoted up here near the related discussion):
>
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@xxxxxxx wrote:
> > maybe not in glibc, but a libc can internally use alt shadow stack
> > in sigaltstack instead of exposing a separate sigaltshadowstack api.
> > (this is what a strict posix conform implementation has to do to
> > support shadow stacks), leaking shadow stacks is not a correctness
> > issue unless it prevents the program working (the shadow stack for
> > the main thread likely wastes more memory than all the alt stack
> > leaks. if the leaks become dominant in a thread the sigaltstack
> > libc api can just fail).
>
> It seems like your priority must be to make sure pure C apps don't have
> to make any changes in order to not crash with shadow stack enabled.
> And this at the expense of any performance and memory usage. Do you
> have some formalized priorities or design philosophy you can share?
>
> Earlier you suggested glibc should create new interfaces to handle
> makecontext() (makes sense). Shouldn't the same thing happen here? In
> which case we are in code-changes territory and we should ask ourselves
> what apps really need.

instead of priority, i'd say "posix conform c apps work
without change" is a benchmark i use to see if the design
is sound.

i do not have a particular workload (or distro) in mind, so
i have to reason through the cases that make sense and the
current linux syscall abi allows, but fail or difficult to
support with shadow stacks.

one such case is jumping back to an alt stack (i.e. inactive
live alt stack):

- with shared shadow stack this does not work in many cases.

- with alt shadow stack this extends the lifetime beyond the
point it become inactive (so it cannot be freed).

if there are no inactive live alt stacks then *both* shared
and implicit alt shadow stack works. and to me it looked
like implicit alt shadow stack is simply better of those two
(more alt shadow stack use-cases are supported, shadow stack
overflow can be handled. drawback: complications due to the
discontinous shadow stack.)

on arm64 i personally don't like the idea of "deal with alt
shadow stack later" because it likely requires a v2 abi
affecting the unwinder and jump implementations. (later
extensions are fine if they are bw compat and discoverable)

one nasty case is shadow stack overflow handling, but i
think i have a solution for that (not the nicest thing:
it involves setting the top bit on the last entry on the
shadow stack instead of adding a new entry to it. + a new
syscall that can switch to this entry. i havent convinced
myself about this yet).

>
> >
> > > >   we
> > > >   can ignore that corner case and adjust the model so the shared
> > > >   shadow stack works for alt stack, but it likely does not change
> > > > the
> > > >   jump design: eventually we want alt shadow stack.)
> > >
> > > As we discussed previously, alt shadow stack can't work
> > > transparently
> > > with existing code due to the sigaltstack API. I wonder if maybe
> > > you
> > > are trying to get at something else, and I'm not following.
> >
> > i would like a jump design that works with alt shadow stack.
>
> A shadow stack switch could happen based on the following scenarios:
> 1. Alt shadow stack
> 2. ucontext
> 3. custom stack switching logic
>
> If we leave a token on signal, then 1 and 2 could be guaranteed to have
> a token *somewhere* above where setjmp() could have been called.
>
> The algorithm could be to search from the target SSP up the stack until
> it finds a token, and then switch to it and INCSSP back to the SSP of
> the setjmp() point. This is what we are talking about, right?
>
> And the two problems are:
> - Alt shadow stack overflow problem
> - In the case of (3) there might not be a token
>
> Let's ignore these problems for a second - now we have a solution that
> allows you to longjmp() back from an alt stack or ucontext stack. Or at
> least it works functionally. But is it going to actually work for
> people who are using longjmp() for things that are supposed to be fast?

slow longjmp is bad. (well longjmp is actually always slow
in glibc because it sets the signalmask with a syscall, but
there are other jump operations that don't do this and want
to be fast so yes we want fast jump to be possible).

jumping up the shadow stack is at least linear time in the
number of frames jumped over (which already sounds significant
slowdown however this is amortized by the fact that the stack
frames had to be created at some point and that is actually a
lot more expensive because it involves write operations, so a
zero cost jump will not do any asymptotic speedup compared to
a linear cost jump as far as i can see.).

with my proposed solution the jump is still linear. (i know
x86 incssp can jump many entries at a time and does not have
to actually read and check the entries, but technically it's
linear time too: you have to do at least one read per page to
have the guardpage protection). this all looks fine to me
even for extreme made up workloads.

> Like, is this the tradeoff people want? I see some references to fiber
> switching implementations using longjmp(). I wonder if the existing
> INCSSP loops are not going to be ideal for every usage already, and
> this sounds like going further down that road.
>
> For jumping out occasionally in some error case, it seems it would be
> useful. But I think we are then talking about targeting a subset of
> people using these stack switching patterns.
>

i simply don't see any trade-off here (i expect no measurable
difference with a scanning vs no scanning approach even in
a microbenchmark that does longjmp in a loop independently of
the stack switch pattern and even if the non-scanning
implementation can do wrss).

> Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
> POP shadow stack instructions? Can ARM just push a restore token at
> setjmp time, like I was trying to figure out earlier with a push token
> arch_prctl? It would be good to understand how ARM is going to
> implement this with these differences in what is allowed by the HW.
>
> If there are differences in how locked down/functional the hardware
> implementations are, and if we want to have some unified set of rules
> for apps, there will need to some give and take. The x86 approach was
> mostly to not support all behaviors and ask apps to either change or
> not enable shadow stacks. We don't want one architecture to have to do
> a bunch of strange things, but we also don't want one to lose some key
> end user value.
>
> I'm thinking that for pure tracing users, glibc might do things a lot
> differently (use of WRSS to speed things up). So I'm guessing we will
> end up with at least one more "policy" on the x86 side.
>
> I wonder if maybe we should have something like a "max compatibility"
> policy/mode where arm/x86/riscv could all behave the same from the
> glibc caller perspective. We could add kernel help to achieve this for
> any implementation that is more locked down. And maybe that is x86's v2
> ABI. I don't know, just sort of thinking out loud at this point. And
> this sort of gets back to the point I keep making: if we need to decide
> tradeoffs, it would be great to get some users to start using this and
> start telling us what they want. Are people caring mostly about
> security, compatibility or performance?
>
> [snip]
>