Re: [patch 28/30] x86/microcode: Handle "offline" CPUs correctly

From: Thomas Gleixner
Date: Fri Aug 11 2023 - 05:37:48 EST


On Thu, Aug 10 2023 at 22:46, Peter Zijlstra wrote:

> On Thu, Aug 10, 2023 at 08:38:07PM +0200, Thomas Gleixner wrote:
>
>> for_each_cpu_and(cpu, cpu_present_mask, &cpus_booted_once_mask) {
>> + /*
>> + * Offline CPUs sit in one of the play_dead() functions
>> + * with interrupts disabled, but they still react on NMIs
>> + * and execute arbitrary code. Also MWAIT being updated
>> + * while the offline CPU sits there is not necessarily safe
>> + * on all CPU variants.
>> + *
>> + * Mark them in the offline_cpus mask which will be handled
>> + * by CPU0 later in the update process.
>> + *
>> + * Ensure that the primary thread is online so that it is
>> + * guaranteed that all cores are updated.
>> + */
>> if (!cpu_online(cpu)) {
>> + if (topology_is_primary_thread(cpu) || !allow_smt_offline) {
>> + pr_err("CPU %u not online, loading aborted\n", cpu);
>
> We could make the NMI handler do the ucode load, no? Also, you just need
> any thread online, don't particularly care about primary thread or not
> afaict.

Yes, we could. But I did not go there because it's a fricking nightmare
vs. the offline state and noinstr.

OTOH, it's not really required. Right now we mandate that _all_ present
cores have at least one sibling online. For simplicity (and practical
reasons - think "nosmt") we require the "primary" thread to be online.

Microcode is strict per core, no matter how many threads are there. We
would not need any of this mess if Intel would have synchronized the
threads on microcode update like AMD does. This is coming with future
CPUs which advertise "uniform" update with a scope ranging from core,
package to systemwide.

Even today, the only exercise what online SMT siblings do after the
primary thread updated the microcode is verification that update
happened which creates consistent software state. But in principle the
secondaries could just do nothing and everything would work (+/-
hardware,firmware bugs).

Sure we could lift that requirement, but why making this horrorshow even
more complex than it is already ?

There is zero point to support esoteric usecases just because we
can. The realistic use case is a server with all threads online or SMT
disabled via command line or sysfs. Anything else is just a pointless
exercise.

Thanks,

tglx