RE: [PATCH] x86/entry/64: randomize kernel stack offset upon syscall

From: Reshetova, Elena
Date: Tue May 28 2019 - 08:32:20 EST


> > With 5 bits there's a ~96.9% chance of crashing the system in an attempt,
> > the exploit cannot be used for a range of attacks, including spear
> > attacks and fast-spreading worms, right? A crashed and inaccessible
> > system also increases the odds of leaving around unfinished attack code
> > and leaking a zero-day attack.
>
> Yup, which is why I'd like to have _something_ here without us getting
> lost in the "perfect entropy" weeds. :)

I really start to believe that we cannot make good randomness sources behave
fast enough for per-syscall usage if our target is 1-2% overhead under worst possible
(and potentially unrealistic ) scenario (stress test of some simple syscall like getpid()).
The only thing that would fit the margin is indeed rdtsc().

I profiled the path in use with get_random_bytes() and results look like this
(arch_get_random_long in not inline for measurement purpose here):

> >
> > | | | --9.44%--random_get_byte
> > | | | |
> > | | | --8.08%--get_random_bytes
> > | | | |
> > | | | --7.80%--_extract_crng.constprop.45
> > | | | |
> > | | | |--4.95%--arch_get_random_long
> > | | | |
> > | | | --2.39%--chacha_block


And here is the proof that under such usage _extract_crng bottlenecks on rdrand:

PerfTop: 5877 irqs/sec kernel:78.6% exact: 100.0% [4000Hz cycles:ppp], (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------------------
Showing cycles:ppp for _extract_crng.constprop.46
Events Pcnt (>=5%)
Percent | Source code & Disassembly of kcore for cycles:ppp (2104 samples, percent: local period)
-------------------------------------------------------------------------------------------------------
0.00 : ffffffff9abd1a62: mov $0xa,%edx
97.94 : ffffffff9abd1a67: rdrand %rax

And then of course there is chacha permutation itself. So, I think Andy's proposal to rewrite
"get_random_bytes" for speed is not so easy to implement.

So, given that all we want is to raise the bar for attackers to predict the stack location
on subsequent syscall, is it really worth to try to come up with more complex solutions than
just using lower bits of rdtsc() by default?

One idea that I got suggested last week is to create a pool of good randomness and
then during syscall select a random number from the pool using smth rdtsc()%POOL_SIZE.
Pool would need to be refilled periodically, outside of syscall path to maintain diversity.
I can try this approach, if people believe that it would address the security concerns around
rdtsc() (my personal feeling is that one can still time attack this if we assume that rdtsc
can be attacked and complexity of the whole thing increases considerably).

If we decide that this is too much trouble for just 5 bits of randomness we need per syscall, I would
still propose we reconsider original rdtsc() approach since it is still better than nothing.
We can have the whole thing on three levels:

CONFIG_RANDOMIZE_KSTACK_OFFSET - off - no randomization, like now
CONFIG_RANDOMIZE_KSTACK_OFFSET on with rdtsc(), fast, better than nothing, but prone to
timing attacks
CONFIG_RANDOMIZE_KSTACK_OFFSET based on get_random_bytes() with better security guarantees.

Performance numbers for will approx. look like

No randomization: Simple syscall: 0.0534 microseconds
With rdtsc(): Simple syscall: 0.0539 microseconds
Wih get_random_bytes(4096 buffer): Simple syscall: 0.0597 microseconds

Pure rdrand option with calling rdrand_long every 10th syscall is considerably slower

With rdrand (every 10th syscall): Simple syscall: 0.0719 microseconds

And I guess we should once again remember that these are *not* the numbers that real
users will see in practice since I doubt we have the real loads issuing millions of *very
lightweight* syscalls in a loop, so this is really more "theoretical, worst case ever" numbers.

If someone could actually propose a reasonable *practical* workload to measure with,
then we can see what is the overhead on that both for rdtsc and get_random_bytes().

Best Regards,
Elena.