Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch

From: Andy Lutomirski
Date: Fri Jun 15 2018 - 14:32:14 EST


On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> Hi Andy & folks,
>
> Lots of crypto routines look like this:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> If you call such a routine twice, you get:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> In a loop this looks like:
>
> for (thing) {
> kernel_fpu_begin();
> encrypt(thing);
> kernel_fpu_end();
> }
>
> This is obviously very bad, because begin() and end() are slow, so
> WireGuard does the obvious:
>
> kernel_fpu_begin();
> for (thing)
> encrypt(thing);
> kernel_fpu_end();
>
> This is fine and well, and the crypto API I'm working on will enable
> this to be done in a clear way, but I do wonder if maybe this is not
> something that should be happening at the level of the caller, but
> rather in the fpu functions themselves. Namely, what are your thoughts
> on modifying kernel_fpu_end() so that it doesn't actually restore the
> state, but just reenables preemption and marks that on the next
> context switch, the state should be restored? Then, essentially,
> kernel_fpu_begin() and end() become free after the first usage of
> kernel_fpu_begin().
>
> Is this something feasible? I know that performance-wise, I'm really
> gaining a lot from hoisting those functions out of the loops, and API
> wise, it'd be slightly simpler to implement if I didn't have to all
> for that hoisting.

Hi Jason-

Funny you asked. This has been discussed a few times, although not
quite in the form you imagined. The idea that we've tossed around is
to restore FPU state on return to user mode. Roughly, we'd introduce
a new thread flag TIF_FPU_UNLOADED (name TBD).
prepare_exit_to_usermode() would notice this flag, copy the fpstate to
fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
one has quite thought it through, but I think it should be outside the
loop.) We'd update all the FPU accessors to understand the flag.
We'd probably have a helper like:

void unload_user_fpstate(void)

that would do nothing if TIF_FPU_UNLOADED is set. If TIF_FPU_UNLOADED
is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
We'd call unload_user_fpstate() during context switch in the prev task
context and in kernel_fpu_begin().

The one major complication I know of is that PKRU is different from
all other FPU state in that it needs to be loaded when *kernel* code
does any user memory access. So we'd need to make sure that context
switches load the new task's PKRU eagerly. Using WRPKRU is easy, but,
unless we do something very clever, actually finding PKRU in the
in-memory fpstate image may be slightly nontrivial.

I suppose we could eagerly load the new FPU state on context switches,
but I'm not sure I see any point. We'd still have to load it when we
switch back to user mode, so it would be slower but not necessarily
any simpler.

I'd love to review patches to make this change, but I don't have the
bandwidth to write them right now.