Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: H. Peter Anvin
Date: Tue Jan 30 2024 - 13:41:23 EST


On January 30, 2024 10:05:59 AM PST, "Daniel P. Berrangé" <berrange@xxxxxxxxxx> wrote:
>On Tue, Jan 30, 2024 at 06:49:15PM +0100, Jason A. Donenfeld wrote:
>> On Tue, Jan 30, 2024 at 6:32 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>> >
>> > On 1/30/24 05:45, Reshetova, Elena wrote:
>> > >> You're the Intel employee so you can find out about this with much
>> > >> more assurance than me, but I understand the sentence above to be _way
>> > >> more_ true for RDRAND than for RDSEED. If your informed opinion is,
>> > >> "RDRAND failing can only be due to totally broken hardware"
>> > > No, this is not the case per Intel SDM. I think we can live under a simple
>> > > assumption that both of these instructions can fail not just due to broken
>> > > HW, but also due to enough pressure put into the whole DRBG construction
>> > > that supplies random numbers via RDRAND/RDSEED.
>> >
>> > I don't think the SDM is the right thing to look at for guidance here.
>> >
>> > Despite the SDM allowing it, we (software) need RDRAND/RDSEED failures
>> > to be exceedingly rare by design. If they're not, we're going to get
>> > our trusty torches and pitchforks and go after the folks who built the
>> > broken hardware.
>> >
>> > Repeat after me:
>> >
>> > Regular RDRAND/RDSEED failures only occur on broken hardware
>> >
>> > If it's nice hardware that's gone bad, then we WARN() and try to make
>> > the best of it. If it turns out that WARN() was because of a broken
>> > hardware _design_ then we go sharpen the pitchforks.
>> >
>> > Anybody disagree?
>>
>> Yes, I disagree. I made a trivial test that shows RDSEED breaks easily
>> in a busy loop. So at the very least, your statement holds true only
>> for RDRAND.
>>
>> But, anyway, if the statement "RDRAND failures only occur on broken
>> hardware" is true, then a WARN() in the failure path there presents no
>> DoS potential of any kind, and so that's a straightforward conclusion
>> to this discussion. However, that really hinges on "RDRAND failures
>> only occur on broken hardware" being a true statement.
>
>There's a useful comment here from an Intel engineer
>
>https://web.archive.org/web/20190219074642/https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed
>
> "RDRAND is, indeed, faster than RDSEED because it comes
> from a hardware-based pseudorandom number generator.
> One seed value (effectively, the output from one RDSEED
> command) can provide up to 511 128-bit random values
> before forcing a reseed"
>
>We know we can exhaust RDSEED directly pretty trivially. Making your
>test program run in parallel across 20 cpus, I got a mere 3% success
>rate from RDSEED.
>
>If RDRAND is reseeding every 511 values, RDRAND output would have
>to be consumed significantly faster than RDSEED in order that the
>reseed will happen frequently enough to exhaust the seeds.
>
>This looks pretty hard, but maybe with a large enough CPU count
>this will be possible in extreme load ?
>
>So I'm not convinced we can blindly wave away RDRAND failures as
>guaranteed to mean broken hardware.
>
>With regards,
>Daniel

The general approach has been "don't credit entropy and try again on the next interrupt." We can, of course, be much more aggressive during boot.

We only need 256-512 bits for the kernel random pool to be equivalent to breaking mainstream crypto primitives even if it is a PRNG only from that point on (which is extremely unlikely.) The Linux PRNG has a very large state, which helps buffer entropy variations.

Again, applications *should* be using /dev/[u]random as appropriate, and if they opt to use lower level primitives in user space they need to implement them correctly – there is literally nothing the kernel can do at that point.

If the probability of success is 3% per your CPU that is still 2 bits of true entropy per invocation. However, the probability of failure after 16 loops is over 60%. I think this validates the concept of continuing to poll periodically rather than looping in time critical paths.