Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: Dr. Greg
Date: Thu Feb 08 2024 - 06:50:22 EST


On Tue, Feb 06, 2024 at 04:35:29PM +0100, Borislav Petkov wrote:

Good morning, or perhaps afternoon, thanks for taking the time to
reply.

> On Tue, Feb 06, 2024 at 06:04:45AM -0600, Dr. Greg wrote:
> > The silence appears to be deafening out of the respective engineering
> > camps... :-)

> I usually wait for those threads to "relax" themselves first. :)

Indeed, my standard practice is to wait 24 hours before replying to
any public e-mail, hence the delay in my response.

> So, what do you wanna know?

I guess a useful starting point would be if AMD would like to offer
any type of quantification for 'astronomically small' when it comes to
the probability of failure over 10 RDRAND attempts... :-)

Secondly, given our test findings and those of RedHat, would it be
safe to assume that EPYC has engineering that prevents RDSEED failures
that Ryzen does not?

Given HPA's response in this thread, I do appreciate that all of this
may be shrouded in trade secrets and other issues. With an
acknowledgement to that fact, let me see if I can extend the
discussion in a generic manner that may prove useful to the community
without being 'abusive'.

Both AMD and Intel designs start with a hardware based entropy source.
Intel samples thermal/quantum junction noise, AMD samples execution
jitter over a bank of inverter based oscillators. An assumption of
constant clocked sampling implies a maximum randomness bandwidth
limit.

None of this implies that randomness is a finite resource, it will
always become available, with the caveat that a core may have to stand
in line, cup in hand, waiting for a dollop.

So this leaves the fundamental question of what does an RDRAND or
RDSEED failure return actually imply?

Silicon is a expensive resource, which would imply a queue depth
limitation for access to the socket common RNG infastructure. If the
queue is full when an instruction issues, it would be a logical
response to signal an instruction failure quickly and let software try
again.

An alternate theory would be a requirement for constant instruction
time completion. In that case a 'buffer' of cycles would be included
in the RNG instruction cycle allocation count. If the instruction
would need to 'sleep', waiting for randomness, beyond this cycle
buffer, a failure would be returned.

Absent broken hardware, astronomical then becomes the probability of a
core being unlucky enough to run into these or alternate
implementation scenarios 10 times in a row. Particularly given the
recommendation to sleep between attempts, which implies getting
scheduled onto different cores for the attempts.

Any enlightenment along these lines would seem to be useful in
facilitating an understanding of the issues at hand.

Given the time and engineering invested in the engineering behind both
TDX and SEV-SNP, it would seem unlikely that really smart engineers at
both Intel and AMD didn't anticipate this issue and its proper
resolution for CoCo environments.

> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

All the best from the Upper Midwest.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project