Re: [Patch v3 Part2 4/9] x86/microcode: Do not call apply_microcode() on sibling threads

From: Ashok Raj
Date: Thu Feb 02 2023 - 11:36:10 EST


Thanks Boris,

On Thu, Feb 02, 2023 at 10:40:31AM +0100, Borislav Petkov wrote:
> On Wed, Feb 01, 2023 at 06:51:44PM -0800, Ashok Raj wrote:
> > T1 shouldn't be sent down the apply_microcode() path, but instead
> > just gather and update the per-cpu info reflecting the current revision.
>
> As I said previously, it doesn't apply the microcode but simply updates
> the cached info. The assumption was that T0 would not fail.

That is super clear. But the conclusion was, in the case T0 fails there are
some side effect that are hard to manage with the current algorighm.

That was the motivation for the change. Sending T1 to just collect and
report the revision, but not via apply_microcode() but via some alternate
function that doesn't have any bad interaction in the possible case if T0
failed.

>
> > At wait_for_siblings: Each thread can check their rev against the BSP (yes
> > BSP can be removed, but we can elect a core leader) and if they are
>
> A core leader?
>
> The one who succeeded in doing the update?

My bad, core leader is the wrong term, I meant something like how MCE
selects a rendezvous leader today. IRC, what we call as the monarch in
MCE land when it handles broadcast MCE's.

>
> > different we can either warn/taint or pull the plug and panic.
>
> What about SMT=off? How are you gonna handle that? Who's the core leader
> there?

This is broken today, IRC, we started to check cpu_present_mask and
cpu_online_mask counts are equal. This change to allow was a late change.

07d981ad4cf1 ("x86/microcode: Allow late microcode loading with SMT disabled")

Since those threads are in mwait(), I'm thinking for correctness we need to
revert that change.

Reverting the patch and borking late-load if ht=off was used, seems the
mitigation to prevent that.

Most BIOS's should have a similar ht_off, and that is the best way to turn
off HT. Its probably better that way to get all execution core resources
reserved for that one active thread.

>
> What about the other vendor? That hackery of yours needs to be verified
> there too.

You mean updating the revision only outside of the path to
apply_microcode()?

Its only AMD/Intel today for microcode updates. Yes we should test and
validate if this assumption is correct. But that's the *assumption* today
correct?

>
> So this whole deal needs to be analyzed properly. What would happen in
> any potential outcome, how should the kernel behave in each case, etc,
> etc.

Complete aggreement. The one case we overlooked in the current algorithm
was "what if any core is in disaggreement with each other"

If there are other cases that come to mind, I can invest in looking into
that.

>
> In thinking about the minrev, I can just as well imagine that such
> microcode patches which could cause such a failure should be marked and
> not loaded late either. But I'm sure you don't want that.

Once there is minrev enforcement, default action, this is automatic,
since we refuse to load anything with minrev=0.

They automatically become eligible for early load only.

>
> In any case, without a proper analysis and writeup how that new
> algorithm is going to look like and behave, this is not going anywhere.
> No ad-hoc patches, no let's fix that aspect first.

I'm happy to add more into Documentation/x86/microcode.rst. I can work with
Dave Hansen on closing or enumerating the algorithm and why we made the
choices.

Cheers,
Ashok