Re: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization

From: Jeremy Linton
Date: Fri Mar 08 2024 - 11:50:11 EST

Next message: Conor Dooley: "Re: [PATCH v4 2/4] dt-bindings: media: add Maxim MAX96714 GMSL2 Deserializer"
Previous message: Christoph Hellwig: "Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps"
Next in thread: Arnd Bergmann: "Re: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 3/7/24 05:10, Arnd Bergmann wrote:

On Wed, Mar 6, 2024, at 22:54, Jeremy Linton wrote:

On 3/6/24 14:46, Arnd Bergmann wrote:

On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:

On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:

The existing arm64 stack randomization uses the kernel rng to acquire
5 bits of address space randomization. This is problematic because it
creates non determinism in the syscall path when the rng needs to be
generated or reseeded. This shows up as large tail latencies in some
benchmarks and directly affects the minimum RT latencies as seen by
cyclictest.

Other architectures are using timers/cycle counters for this function,
which is sketchy from a randomization perspective because it should be
possible to estimate this value from knowledge of the syscall return
time, and from reading the current value of the timer/counters.

As I commented on the previous version, I don't want to see
a change that only addresses one architecture like this. If you
are convinced that using a cycle counter is a mistake, then we
should do the same thing on the other architectures as well
that currently use a cycle counter.

I personally tend to agree as long as we aren't creating a similar set
of problems for those architectures as we are seeing on arm. Currently
the kstack rng on/off choice is basically zero overhead for them.

I think we have two separate problems to solve here: how
strong a kernel stack randomization should be on a given system,
and who gets to make the decision.

For the strength, we have at least four options:

- strong rng, most expensive
- your new prng, less strong but somewhat cheaper and/or more
predictable overhead
- cycle counter, cheap but probably even less strong,
needs architecture code.
- no rng, no overhead and no protection.

On architectures that have a cheap hardware rng instruction, you
can count that one as well.

For picking between the options, we have

- architecture maintainer
- defconfig (CONFIG_RANDOMIZE_KSTACK_OFFSET)
- boot time (randomize_kstack_offset=on)

For both of these lists, I would like to see as few options
as possible, and in particular I think the architecture
maintainer should not make that decision for the users.

If we want to improve this, I would start by changing
the binary CONFIG_RANDOMIZE_KSTACK_OFFSET option into
a three-way choice between cycle counter (where available),
strong rng and off, possibly adding the cycle counter
option to the two architectures that currently don't use
it, while moving the strong rng into common code.

After that, we can debate adding a fourth option, or
replacing one of the existing two that is too slow or
not random enough.

Won't that defeat the purpose of the patch that was intended
to make the syscall latency more predictable? At least the
simpler approaches of reseeding from the kstack_rng()
function itself would have this problem, deferring it to
another context comes with a separate set of problems.

And that describes why I've not come up with an inline reseeding
solution. Which of course isn't a problem on !arm if one just pushes a
few bits of a cycle counter into the rnd_state every few dozen syscalls,
or whatever. Mark R, mentioned offline the idea of just picking a few
bits off CNTVCT as a seed, but its so slow it basically has to be used
to fuzz a bit or two of rnd_state on some fairly long interval. Long
enough that if someone has a solution for extracting rnd_state it might
not add any additional security. Or that is my take, since i'm not a big
fan of any independent counter/clock based RNG seeding (AFAIK, entropy
from clocks requires multiple _independent_ sources).

I'm not sure I understand the logic. Do you mean that accessing
CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
because of the overhead of reseeding?

Slow, as in, its running at a much lower frequency than a cycle counter.

On powerpc/s390/x86, the low bits of the respective cycle counter
is simply used without any kind of bit shuffling, and we already
rely on the cycle counter to be fast to access since it's used
for timekeeping anywhere.

There is not even any attempt to use the most random bits of
the cycle counter, as both the high 22 to 24 bits get masked
out (to keep the wasted stack space small) and the low 3 or 4
bits get ignored because of stack alignment. If there was
any desire to make it more random, a trivial improvement
would be:

+++ b/include/linux/randomize_kstack.h
@@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
&randomize_kstack_offset)) { \
u32 offset = raw_cpu_read(kstack_offset); \
- offset ^= (rand); \
+ offset = ror32(offset, 5) & (rand); \
raw_cpu_write(kstack_offset, offset); \
} \
} while (0)

My impression is that is is already bordering on becoming
a "bespoke rng" implementation that Jason was objecting to,
so the current version is intentionally left weak in order
to not even give the appearance of being a security relevant
feature.

This is a bit out of my wheelhouse, so I defer to anyone with a better
feel or some actual data.

The best plan I have at the moment is just some deferred work to call
kstack_rng_setup on some call or time based interval, which AFAIK isn't
ideal for RT workloads which expect ~100% CPU isolation. Plus, that
solution assumes we have some handle on how fast an attacker can extract
kstackrng sufficiently to make predictions.

I think you fundamentally can't have both. If you rely on the
reseeding to happen for a particular number of syscalls, you
eventually end up blocking on it, regardless of the context
it runs in. Deferring it to another process means blocking
for longer, and deferring it to a low-priority task would mean
that real-time tasks get less randomness.

Arnd

Next message: Conor Dooley: "Re: [PATCH v4 2/4] dt-bindings: media: add Maxim MAX96714 GMSL2 Deserializer"
Previous message: Christoph Hellwig: "Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps"
Next in thread: Arnd Bergmann: "Re: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]