Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: Dr. Greg
Date: Wed Jan 31 2024 - 15:40:42 EST


On Wed, Jan 31, 2024 at 02:06:13PM +0100, Jason A. Donenfeld wrote:

Hi again to everyone, beautiful day here in North Dakota.

> On Wed, Jan 31, 2024 at 9:17???AM Reshetova, Elena
> <elena.reshetova@xxxxxxxxx> wrote:
> > This matches both my understanding (I do have cryptography background
> > and understanding how cryptographic RNGs work)
> > and official public docs that Intel published on this matter.
> > Given that the physical entropy source is limited anyhow, and by giving
> > enough pressure on the whole construction you should be able to
> > make RDRAND fail because if the intermediate AES-CBC MAC extractor/
> > conditioner is not getting its min entropy input rate, it wont
> > produce a proper seed for AES CTR DRBG.
> > Of course exact details/numbers can wary between different generations of
> > Intel DRNG implementation, and the platforms where it is running on,
> > so be careful to sticking to concrete numbers.

> Alright, so RDRAND is not reliable. The question for us now is: do
> we want RDRAND unreliability to translate to another form of
> unreliability elsewhere, e.g. DoS/infiniteloop/latency/WARN_ON()? Or
> would it be better to declare the hardware simply broken and ask
> Intel to fix it? (I don't know the answer to that question.)

I think it would demonstrate a lack of appropriate engineering
diligence on the part of our community to declare RDRAND 'busted' at
this point.

While it appeares to be trivially easy to force RDSEED into depletion,
there does not seem to be a suggestion, at least in the open
literature, that this directly or easily translates into stalling
output from RDRAND in any type of relevant adversarial fashion.

If this were the case, given what CVE's seem to be worth on a resume,
someone would have rented a cloud machine and come up with a POC
against RDRAND in a multi-tenant environment and then promptly put up
a web-site called 'Random Starve' or something equally ominous.

This is no doubt secondary to the 1022x amplication factor inherent in
the 'Bull Mountain' architecture.

I'm a bit surprised that no one from the Intel side of this
conversation didn't pitch this over the wall as soon as this
conversation came up, but I would suggest that everyone concerned
about this issue give the following a thorough read:

https://www.intel.com/content/www/us/en/developer/articles/guide/intel-digital-random-number-generator-drng-software-implementation-guide.html

Relevant highlights:

- As I suggested in my earlier e-mail, random number generation is a
socket based resource, hence an adversarial domain limited to only
the cores on a common socket.

- There is a maximum randomness throughput rate of 800 MB/s over all
cores sharing common random number infrastructure. Single thread
throughput rates of 70-200 MB/s are demonstratable.

- A failure of RDRAND over 10 re-tries is 'astronomically' small, with
no definition of astronomical provided, one would assume really
small, given they are using the word astronomical.

> > That said, I have taken an AR to follow up internally on what can be done
> > to improve our situation with RDRAND/RDSEED.

I think I can save you some time Elena.

> Specifying this is an interesting question. What exactly might our
> requirements be for a "non-broken" RDRAND? It seems like we have two
> basic ones:
>
> - One VMX (or host) context can't DoS another one.
> - Ring 3 can't DoS ring 0.
>
> I don't know whether that'd be implemented with context-tied rate
> limiting or more state or what. But I think, short of just making
> RDRAND never fail, that's basically what's needed.

I think we probably have that, for all intents and purposes, given
that we embrace the following methodogy:

- Use RDRAND exclusively.

- Be willing to take 10 swings at the plate.

- Given the somewhat demanding requirements for TDX/COCO, fail and
either deadlock or panic after 10 swings since that would seem to
suggest the hardware is broken, ie. RMA time.

Either deadlock or panic would be appropriate. The objective in the
COCO environment is to get the person who clicked on the 'Enable Azure
Confidential' checkbox, or its equivalent, on their cloud dashboard,
to call the HelpDesk and ask them why their confidential application
won't come up.

After the user confirms to the HelpDesk that their computer is plugged
in, the problem will get fixed. Either the broken hardware will be
identified and idled out or the mighty sword of vengeance will be
summoned down on whoever has all of the other cores on the socket
pegged.

Final thoughts:

- RDSEED is probably a poor thing to be using.

- There may be a reasonable argument that RDSEED shouldn't have been
exposed above ring 0, but that ship has sailed. Brownie points
moving forward for an RDsomething that is ring 0 and has guaranteed
access to some amount of functionally reasonable entropy.

- Intel and AMD are already doing a lot of 'special' stuff with their
COCO hardware in order to defy the long standing adage of: 'You
can't have security without physical security'. Access to per core thermal
noise, as I suggested, is probably a big lift but clever engineers can
probably cook up some type of fairness doctrine for randomness in
TDX or SEV_SNP, given the particular importance of instruction based
randomness in COCO.

- Perfection is the enemy of good.

> Jason

Have a good day.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project