Re: kernel %rsp code at sysenter PTI vs no-PTI

From: Andy Lutomirski
Date: Sat Jul 21 2018 - 20:03:01 EST


On Thu, Jul 5, 2018 at 10:14 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> The PTI path does this:
>
> ...
> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
> /* Load the top of the task stack into RSP */
> movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
>
> And the non-PTI entry path does this:
>
> ...
> movq %rsp, PER_CPU_VAR(rsp_scratch)
> movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> Both "mov ___, %rsp" instructions have the kernel %GS value in place and
> both are running on a good kernel CR3. Does anybody remember why we
> don't use cpu_current_top_of_stack in the PTI-on case?
>
> I'm wondering if it was because we, at some point, did the mov ...,
> %rsp before CR3 was good. But it doesn't look like we do that now, so
> should we maybe make both copies do:
>
> movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

Speed, sort of. Without the CR3 switch there (i.e. PTI off, but
trampoline still in use, which is the path that actually gets used),
there's no forced serialization between swapgs and that movq. And it
turns out that the RIP-relative load avoids a pipeline stall that the
%gs-relative access right after swapgs would cause. So, with all the
mitigations off, the trampoline ends up being *faster*, at least in a
tight loop, than the non-trampolined path.

Of course, on a retpolined kernel, the retpoline at the end kills performance.

>
> for consistency?