Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: Borislav Petkov
Date: Fri Feb 09 2024 - 12:31:36 EST


On Thu, Feb 08, 2024 at 05:44:44AM -0600, Dr. Greg wrote:
> I guess a useful starting point would be if AMD would like to offer
> any type of quantification for 'astronomically small' when it comes to
> the probability of failure over 10 RDRAND attempts... :-)

Right, let's establish the common ground first: please have a look at
this, albeit a bit outdated whitepaper:

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/white-papers/amd-random-number-generator.pdf

in case you haven't seen it yet.

Now, considering that this is a finite resource, you can imagine that
there can be scenarios where that source can be depleted.

And newer Zen generations perform significantly better. So much so that
on Zen3 and later 10 retries should never observe a failure unless it
is bad hardware. Also, I agree with hpa's note that any and all retries
should be time based.

> Secondly, given our test findings and those of RedHat, would it be
> safe to assume that EPYC has engineering that prevents RDSEED failures
> that Ryzen does not?

Well, roughly speaking, client is a less beefier and less performant
version of server. You can extrapolate that to the topic at hand.

But at least on AMD, any potential DoSing of RDRAND on client doesn't
matter for CoCo because client doesn't enable SEV*.

> Both AMD and Intel designs start with a hardware based entropy source.
> Intel samples thermal/quantum junction noise, AMD samples execution
> jitter over a bank of inverter based oscillators.

See above paper for the AMD side.

> An assumption of constant clocked sampling implies a maximum
> randomness bandwidth limit.

You said it.

> None of this implies that randomness is a finite resource

Huh? This contradicts with what you just said in the above sentence.

Or maybe I'm reading this wrong...

> So this leaves the fundamental question of what does an RDRAND or
> RDSEED failure return actually imply?

Simple: if no random data is ready at the time the insn executes, it
says "invalid". Because the generator is a finite resource as you said
above, if the software tries to pull random data faster than it can
generate, this is the main case for CF=0.

> Silicon is a expensive resource, which would imply a queue depth
> limitation for access to the socket common RNG infastructure. If the
> queue is full when an instruction issues, it would be a logical
> response to signal an instruction failure quickly and let software try
> again.

That's actually in the APM documenting RDRAND:

"If the returned value is invalid, software must execute the instruction
again."

> Given the time and engineering invested in the engineering behind both
> TDX and SEV-SNP, it would seem unlikely that really smart engineers at
> both Intel and AMD didn't anticipate this issue and its proper
> resolution for CoCo environments.

You can probably imagine that no one can do a fully secure system in one
single attempt but rather needs to do an iterative process.

And I don't know how much you've followed those technologies but they
*are* the perfect example for such an iterative improvement process.

I hope this answers at least some of your questions.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette