Re: [PATCH] x86/entry/64: randomize kernel stack offset upon syscall

From: Ingo Molnar
Date: Mon Apr 15 2019 - 03:25:43 EST



* Elena Reshetova <elena.reshetova@xxxxxxxxx> wrote:

> This is an example of produced assembly code for gcc x86_64:
>
> ...
> add_random_stack_offset();
> 0xffffffff810022e9 callq 0xffffffff81459570 <prandom_u32>
> 0xffffffff810022ee movzbl %al,%eax
> 0xffffffff810022f1 add $0x16,%rax
> 0xffffffff810022f5 and $0x1f8,%eax
> 0xffffffff810022fa sub %rax,%rsp
> 0xffffffff810022fd lea 0xf(%rsp),%rax
> 0xffffffff81002302 and $0xfffffffffffffff0,%rax
> ...

Hm, that all to prandom_u32() then does:

ffffffff8146cd70 <prandom_u32>:
ffffffff8146cd70: 48 c7 c7 10 e3 01 00 mov $0x1e310,%rdi
ffffffff8146cd77: 65 48 03 3d d9 23 ba add %gs:0x7eba23d9(%rip),%rdi # f158 <this_cpu_off>
ffffffff8146cd7e: 7e
ffffffff8146cd7f: e8 6c ff ff ff callq ffffffff8146ccf0 <prandom_u32_state>
ffffffff8146cd84: c3 retq

which then does:

ffffffff8146ccf0 <prandom_u32_state>:
ffffffff8146ccf0: 8b 0f mov (%rdi),%ecx
ffffffff8146ccf2: 8b 47 04 mov 0x4(%rdi),%eax
ffffffff8146ccf5: 44 8b 47 08 mov 0x8(%rdi),%r8d
ffffffff8146ccf9: 89 ca mov %ecx,%edx
ffffffff8146ccfb: 8d 34 85 00 00 00 00 lea 0x0(,%rax,4),%esi
ffffffff8146cd02: c1 e2 06 shl $0x6,%edx
ffffffff8146cd05: 31 f0 xor %esi,%eax
ffffffff8146cd07: 83 e6 e0 and $0xffffffe0,%esi
ffffffff8146cd0a: 31 ca xor %ecx,%edx
ffffffff8146cd0c: c1 e1 12 shl $0x12,%ecx
ffffffff8146cd0f: 81 e1 00 00 f8 ff and $0xfff80000,%ecx
ffffffff8146cd15: c1 ea 0d shr $0xd,%edx
ffffffff8146cd18: 31 ca xor %ecx,%edx
ffffffff8146cd1a: 89 c1 mov %eax,%ecx
ffffffff8146cd1c: 44 89 c0 mov %r8d,%eax
ffffffff8146cd1f: c1 e0 0d shl $0xd,%eax
ffffffff8146cd22: c1 e9 1b shr $0x1b,%ecx
ffffffff8146cd25: 89 17 mov %edx,(%rdi)
ffffffff8146cd27: 44 31 c0 xor %r8d,%eax
ffffffff8146cd2a: 41 c1 e0 07 shl $0x7,%r8d
ffffffff8146cd2e: 31 f1 xor %esi,%ecx
ffffffff8146cd30: c1 e8 15 shr $0x15,%eax
ffffffff8146cd33: 41 81 e0 00 f8 ff ff and $0xfffff800,%r8d
ffffffff8146cd3a: 89 4f 04 mov %ecx,0x4(%rdi)
ffffffff8146cd3d: 41 31 c0 xor %eax,%r8d
ffffffff8146cd40: 8b 47 0c mov 0xc(%rdi),%eax
ffffffff8146cd43: 44 89 47 08 mov %r8d,0x8(%rdi)
ffffffff8146cd47: 8d 34 c5 00 00 00 00 lea 0x0(,%rax,8),%esi
ffffffff8146cd4e: 31 c6 xor %eax,%esi
ffffffff8146cd50: c1 e0 0d shl $0xd,%eax
ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax
ffffffff8146cd58: c1 ee 0c shr $0xc,%esi
ffffffff8146cd5b: 31 c6 xor %eax,%esi
ffffffff8146cd5d: 89 d0 mov %edx,%eax
ffffffff8146cd5f: 31 c8 xor %ecx,%eax
ffffffff8146cd61: 89 77 0c mov %esi,0xc(%rdi)
ffffffff8146cd64: 44 31 c0 xor %r8d,%eax
ffffffff8146cd67: 31 f0 xor %esi,%eax
ffffffff8146cd69: c3 retq

and just to recover ~5 bits of randomness:

> As a result of the above gcc-produce code this patch introduces
> a bit more than 5 bits of randomness after pt_regs location on
> the thread stack (33 different offsets are generated
> randomly for x86_64 and 47 for i386).
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.

Note that the reduction in bits is mostly due to:

> +#define add_random_stack_offset() do { \
> + size_t offset = ((size_t)prandom_u32()) % 256; \
> + char *ptr = __builtin_alloca(offset); \
> + asm volatile("":"=m"(*ptr)); \

So if we are really sticking this into the syscall path, could we please
be a bit more smart about it? Note how we are already masking off bits in
prandom_u32_state():

ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax

and then we mask *again*, in a rather stupid way:

> 0xffffffff810022ee movzbl %al,%eax
> 0xffffffff810022f1 add $0x16,%rax
> 0xffffffff810022f5 and $0x1f8,%eax
> 0xffffffff810022fa sub %rax,%rsp
> 0xffffffff810022fd lea 0xf(%rsp),%rax
> 0xffffffff81002302 and $0xfffffffffffffff0,%rax

What I'd suggest is:

1)

Put the PRNG state not into a per-cpu area which has to be recovered, but
put it somewhere convenient, already accessibe easily at that point in
the syscall path - such as struct thread_info? (But don't take my word
for it - if task_struct is a better place then use that.)

This avoids the whole percpu redirection.

2)

Then also please make a private version of that function which is inlined
into the straight execution path: this code is going to be executed for
every single syscall, so inlining is 1000% justified.

3)

Have a really good look at the generated assembly code and justify every
single instruction, and try to eliminate it. Maybe even take a look at
whether there's any simpler PRNG that is better suited: if we have per
thread_info random state then there will be no cross-task leakage of the
"secret" and we can actually weaken the PRNG.

4)

But before you tweak the patch, a more fundamental question:

Does the stack offset have to be per *syscall execution* randomized?
Which threats does this protect against that a simpler per task syscall
random offset wouldn't protect against?

It would be far, far more performant to just add a fork() time randomized
kernel stack offset (and that can be a *real* random offset, not PRNG) to
syscalls. We could even make that an overall default setting if it's fast
enough, which I think it could be made.

> Performance (x86_64 measuments only):
>
> 1) lmbench: ./lat_syscall -N 1000000 null
> base: Simple syscall: 0.1774 microseconds
> random_offset (prandom_u32() every syscall): Simple syscall: 0.1822 microseconds
>
> 2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> base: 10000000 loops in 1.62224s = 162.22 nsec / loop
> random_offset (prandom_u32() every syscall): 10000000 loops in 1.64660s = 166.26 nsec / loop

Was this measured with CONFIG_PAGE_TABLE_ISOLATION turned off? By the
time such patches get to real Linux users we'll likely see modern
processors with Meltdown fixed.

I.e. the relative cost of syscall entry code changes should generally be
measured with CONFIG_PAGE_TABLE_ISOLATION off.

Thanks,

Ingo