Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

From: Andy Lutomirski
Date: Fri Dec 18 2015 - 11:04:36 EST


On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> On December 17, 2015 9:29:21 PM PST, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>On Dec 17, 2015 6:53 PM, "Dave Hansen" <dave.hansen@xxxxxxxxxxxxxxx>
>>wrote:
>>>
>>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
>>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
>><dave.hansen@xxxxxxxxxxxxxxx> wrote:
>>> >> But what about the register state when delivering a signal? Don't
>>we
>>> >> set the registers to the init state? Do we need to preserve PKRU
>>state
>>> >> instead of init'ing it? The init state _is_ nice here because it
>>is
>>> >> permissive allows us to do something useful no matter what PKRU
>>gets set to.
>>> >
>>> > I think we leave the extended regs alone. Don't we?
>>> >
>>> > I think that, for most signals, we want to leave PKRU as is,
>>> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we
>>want
>>> > an option to reset PKRU on delivery (and then set the flag to
>>restore
>>> > on return?).
>>>
>>> Is there some precedent for doing the state differently for different
>>> signals?
>>
>>Yes, to a very limited extent: SA_ONSTACK.
>>
>>>
>>> >> Well, the signal handler isn't necessarily going to clobber it,
>>but the
>>> >> delivery code already clobbers it when going to the init state.
>>> >
>>> > Can you point to that code?
>>>
>>> handle_signal() -> fpu__clear()
>>>
>>> The comment around it says:
>>>
>>> "Ensure the signal handler starts with the new fpu state."
>>>
>>
>>You win this round :)
>>
>>So maybe we should have a mask of xfeatures that aren't cleared on
>>signal delivery (e.g. PKRU, perhaps) and that are, by default,
>>preserved across signal returns.
>>

[...]

>
> I find the notion of PKRU not being initialized at the entry to a signal handler a bit odd. Saving/restoring it seems like the right thing to me.
>
> What am I missing?

On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> So the principle is this: signals are conceptually like creating a new thread of
> execution, and that's a very powerful programming concept, like fork() or
> pthread_create() are powerful concepts. So we definitely want to keep that default
> behavior, and I don't think we want to deviate from that for typical new extended
> CPU context features, even if signal delivery slows down as a result.

I think that PKRU is in almost no respect a typical no extended CPU
context feature. It's a control register, not a data register.

Let's try a concrete example. One big use case for PKRU will be in
conjunction with DAX-style pmem. Imagine some program (database
server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ |
PROT_WRITE with protection key 1. It sets PKRU so that key 1 can't be
written. All of its accesses to the pmem store are in tight code
regions that do wrpkru; mov; wrpkru (just like what the kernel does
with stac/clac on SMAP systems).

>From the app's perspective, it's imperative that stray writes that
coincidentally hit the pmem region fail. (1 TB is big enough that a
random write to canonical user memory hits it almost 1% of the time.)
This means:

1. If a signal is delivered, the app wants PKRU to be set to some safe
value as early as possible. The app can do it by itself if it doesn't
use libraries that interfere, but it would be IMO much, much nicer if
the kernel made this happen automatically.

1b. If the app malfunctions such that RSP points to pmem, the kernel
MUST NOT clobber the pmem space. I think that this basically mandates
that PKRU needs to have some safe state (i.e. definitely not the init
state) on signal delivery: the kernel is going to write a signal frame
at the address identified by RSP, and that address is in pmem, so
those writes need to fail.

2. clone with shared VM should preserve PKRU.

So in this regard, I think signals are kind of like new threads after all.

Now that I think about it more, there are at least two possibilities
that work for this use case.

Option A: signal delivery preserves PKRU. If we go this route, I
think we should decide whether PKRU is preserved on exit. Given that
we're talking about adding syscalls to adjust PKRU, it seems a bit odd
to me that sigreturn would, by default, undo whatever those syscalls
do.

Option B: signal delivery resets PKRU to user-specified values. We're
talking about having syscalls to write PKRU. We could have clone and
signal entry use the values set by syscall instead of the values in
the actual PKRU register. That way:

set_protection_key(1, no writes);
...
wrpkru(allow writes to key 1)
mov something <<<<<<<<<<< fault or async signal here
wrpkru(disallow writes to key 1)

leaves key 1 protected in the signal handler.

If we go that route, I think we must restore PKRU just like any other
xstate on sigreturn, so in some sense it's the simplest of all to
implement. We'll need to change the signal delivery stuff to do the
fpu__clear and the automatic PKRU write before trying to set up the
signal frame so that the frame is written with the syscall-specified
protections instead of the wrpkru-overridden values, but that should
be easy.


Which reminds me: __get_user, etc all respect PKRU because the SDM
says that PKRU affects all user *and* kernel accesses to user memory
(i.e. anything with _PAGE_USER). Should get_user_pages also respect
PKRU?

>
> But we've been arguing about 'lightweight signals' for up to two decades that I
> can remember. (The first such suggestion was to not save the FPU state, back when
> FPU saves were ridiculously slow compared to other parts of saving/restoring a
> context.)

I'm not making any claims about lightweightness here. While
performance is an issue in some cases, I'd like to hash out the
baseline functionality first.

>
> So having a well-enumerated, extensible opt-in mask (which defaults to 'all state
> saved') that allows smart signal handlers to skip the save/restore of certain CPU
> context components would be acceptable I think.

I have a patch that helps considerably with a probably-unnoticeable
ABI change by letting signal delivery use sysret. I'll dust it off.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/