[patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup

From: Thomas Gleixner
Date: Fri Apr 14 2023 - 19:44:24 EST


Hi!

This is a complete rework of the parallel bringup patch series (V17)

https://lore.kernel.org/lkml/20230328195758.1049469-1-usama.arif@xxxxxxxxxxxxx

to address the issues which were discovered in review:

1) The X86 microcode loader serialization requirement

https://lore.kernel.org/lkml/87v8iirxun.ffs@tglx

Microcode loading on HT enabled X86 CPUs requires that the microcode is
loaded on the primary thread. The sibling thread(s) must be in
quiescent state; either looping in a place which is aware of potential
changes by the microcode update (see late loading) or in fully quiescent
state, i.e. waiting for INIT/SIPI.

This is required by hardware/firmware on Intel. Aside of that it's a
vendor independent software correctness issue. Assume the following
sequence:

CPU1.0 CPU1.1
CPUID($A)
Load microcode.
Changes CPUID($A, $B)
CPUID($B)

CPU1.1 makes a decision on $A and $B which might be inconsistent due
to the microcode update.

The solution for this is to bringup the primary threads first and after
that the siblings. Loading microcode on the siblings is a NOOP on Intel
and on AMD it is guaranteed to only modify thread local state.

This ensures that the APs can load microcode before reaching the alive
synchronization point w/o doing any further x86 specific
synchronization between the core siblings.

2) The general design issues discussed in V16

https://lore.kernel.org/lkml/87pm8y6yme.ffs@tglx

The previous parallel bringup patches just glued this mechanism into
the existing code without a deeper analysis of the synchronization
mechanisms and without generalizing it so that the control logic is
mostly in the core code and not made an architecture specific tinker
space.

Much of that had been pointed out 2 years ago in the discussions about
the early versions of parallel bringup already.


The series is based on:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip x86/apic

and also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hotplug


Background
----------

The reason why people are interested in parallel bringup is to shorten
the (kexec) reboot time of cloud servers to reduce the downtime of the
VM tenants. There are obviously other interesting use cases for this
like VM startup time, embedded devices...

The current fully serialized bringup does the following per AP:

1) Prepare callbacks (allocate, intialize, create threads)
2) Kick the AP alive (e.g. INIT/SIPI on x86)
3) Wait for the AP to report alive state
4) Let the AP continue through the atomic bringup
5) Let the AP run the threaded bringup to full online state

There are two significant delays:

#3 The time for an AP to report alive state in start_secondary() on x86
has been measured in the range between 350us and 3.5ms depending on
vendor and CPU type, BIOS microcode size etc.

#4 The atomic bringup does the microcode update. This has been measured
to take up to ~8ms on the primary threads depending on the microcode
patch size to apply.

On a two socket SKL server with 56 cores (112 threads) the boot CPU spends
on current mainline about 800ms busy waiting for the APs to come up and
apply microcode. That's more than 80% of the actual onlining procedure.

By splitting the actual bringup mechanism into two parts this can be
reduced to waiting for the first AP to report alive or if the system is
large enough the first AP is already waiting when the boot CPU finished the
wake-up of the last AP.


The actual solution comes in several parts
------------------------------------------

1) [P 1-2] General cleanups (init annotations, kernel doc...)

2) [P 3] The obvious

Avoid pointless delay calibration when TSC is synchronized across
sockets. That removes a whopping 100ms delay for the first CPU of a
socket. This is an improvement independent of parallel bringup and had
been discussed two years ago already.

2) [P 3-6] Removal of the CPU0 hotplug hack.

This was added 11 years ago with the promise to make this a real
hardware mechanism, but that never materialized. As physical CPU
hotplug is not really supported and the physical unplugging of CPU0
never materialized there is no reason to keep this cruft around. It's
just maintenance ballast for no value and the removal makes
implementing the parallel bringup feature way simpler.

3) [P 7-16] Cleanup of the existing bringup mechanism:

a) Code reorganisation so that the general hotplug specific code is
in smpboot.c and not sprinkled all over the place

b) Decouple MTRR/PAT initialization from smp_callout_mask to prepare
for replacing that mask with a hotplug core code synchronization
mechanism.

c) Make TSC synchronization function call based so that the control CPU
does not have to busy wait for nothing if synchronization is not
required.

d) Remove the smp_callin_mask synchronization point as its not longer
required due to #3c.

e) Rework the sparse_irq_lock held region in the core code so that the
next polling synchronization point in the x86 code can be removed to.

f) Due to #3e it's not longer required to spin wait for the AP to set
it's online bit. Remove wait_cpu_online() and the XENPV
counterpart. So the control CPU can directly wait for the online
idle completion by the AP and free the control CPU up for other
work.

This reduces the synchronization points in the x86 code to one, which
is the AP alive one. This synchronization will be moved to core
infrastructure in the next section.

4) [P 17-27] Replace the disconnected CPU state tracking

The extra CPU state tracking which is used by a few architectures is
completely separate from the CPU hotplug core code.

Replacing it by a variant integrated in the core hotplug machinery
allows to reduce architecture specific code and provides a generic
synchronization mechanism for (parallel) CPU bringup/teardown.

- Convert x86 over and replace the AP alive synchronization on x86 with
the core variant which removes the remaining x86 hotplug
synchronization masks.

- Convert the other architectures usage and remove the old interface
and code.

5) [P 28-30] Split the bringup into two steps

First step invokes the wakeup function on the BP, e.g. SIPI/STARTUP on
x86. The second one waits on the BP for the AP to report alive and
releases it for the complete onlining.

As the hotplug state machine allows partial bringup this allows later
to kick all APs alive in a first iteration and then bring them up
completely one by one afterwards.

6) [P 31] Switch the primary thread detection to a cpumask

This makes the parallel bringup a simple cpumask based mechanism
without tons of conditionals and checks for primary threads.

7) [P 32] Implement the parallel bringup core code

The parallel bringup looks like this:

1) Bring up the primary SMT threads to the CPUHP_KICK_AP_ALIVE step
one by one

2) Bring up the primary SMT threads to the CPUHP_ONLINE step one by
one

3) Bring up the secondary SMT threads to the CPUHP_KICK_AP_ALIVE
step one by one

4) Bring up the secondary SMT threads to the CPUHP_ONLINE
step one by one

In case that SMT is not supported this is obviously reduced to step #1
and #2.

8) [P 33-37] Prepare X86 for parallel bringup and enable it


Caveats
-------

The non X86 changes have been all compile tested. Boot and runtime
testing has only be done on a few real hardware platforms and qemu as
available. That definitely needs some help from the people who have
these systems at their fingertips.


Results and analysis
--------------------

Here are numbers for a dual socket SKL 56 cores/ 112 threads machine. All
numbers in milliseconds. The time measured is the time which the cpu_up()
call takes for each CPU and phase. It's not exact as the system is already
scheduling, handling interrupts and soft interrupts, which is obviously
skewing the picture slightly.

Baseline tip tree x86/apic branch.

total avg/CPU min max
total : 912.081 8.217 3.720 113.271

The max of 100ms is due to the silly delay calibration for the second
socket which takes 100ms and was eliminated first. Also the other initial
cleanups and improvements take some time away.

So the real baseline becomes:

total avg/CPU min max
total : 785.960 7.081 3.752 36.098

The max here is on the first CPU of the second socket. 20ms of that is due
to TSC synchronization and an extra 2ms to react on the SIPI.

With parallel bootup enabled this becomes:

total avg/CPU min max
prepare: 39.108 0.352 0.238 0.883
online : 45.166 0.407 0.170 20.357
total : 84.274 0.759 0.408 21.240

That's a factor ~9.3 reduction on average.

Looking at the 27 primary threads of socket 0 then this becomes even more
interesting:

total avg/CPU min max
total : 325.764 12.065 11.981 14.125

versus:
total avg/CPU min max
prepare: 8.945 0.331 0.238 0.834
online : 4.830 0.179 0.170 0.212
total : 13.775 0.510 0.408 1.046

So the reduction factor is ~23.5 here. That's mostly because the 20ms TSC
sync is not skewing the picture.

For all 55 primaries, i.e with the 20ms TSC sync extra for socket 1 this
becomes:

total avg/CPU min max
total : 685.489 12.463 11.975 36.098

versus:

total avg/CPU min max
prepare: 19.080 0.353 0.238 0.883
online : 30.283 0.561 0.170 20.357
total : 49.363 0.914 0.408 21.240

The TSC sync reduces the win to a factor of ~13.8

With 'tsc=reliable' on the command line the socket sync is disabled which
brings it back to the socket 0 numbers:

total avg/CPU min max
prepare: 18.970 0.351 0.231 0.874
online : 10.328 0.191 0.169 0.358
total : 29.298 0.543 0.400 1.232

Now looking at the secondary threads only:

total avg/CPU min max
total : 100.471 1.794 0.375 4.745

versus:
total avg/CPU min max
prepare: 19.753 0.353 0.257 0.512
online : 14.671 0.262 0.179 3.461
total : 34.424 0.615 0.436 3.973

Still a factor of ~3.

The average on the secondaries for the serialized bringup is significantly
lower than for the primaries because the SIPI response time is shorter and
the microcode update takes no time.

This varies wildly with the system, whether microcode in BIOS is already up
to date, how big the microcode patch is and how long the INIT/SIPI response
time is. On an AMD Zen3 machine INIT/SIPI response time is amazingly fast
(350us), but then it lacks TSC_ADJUST and does a two millisecond TSC sync
test for _every_ AP. All of this sucks...


Possible further enhancements
-----------------------------

It's definitely worthwhile to look into reducing the cross socket TSC sync
test time. It's probably safe enough to use 5ms or even 2ms instead of 20ms
on systems with TSC_ADJUST and a few other 'TSC is sane' indicators. Moving
it out of the hotplug path is eventually possible, but that needs some deep
thoughts.

Let's take the TSC sync out of the picture by adding 'tsc=reliable" to the
kernel command line. So the bringup of 111 APs takes:

total avg/CPU min max
prepare: 38.936 0.351 0.231 0.874
online : 25.231 0.227 0.169 3.465
total : 64.167 0.578 0.400 4.339

Some of the outliers are not necessarily in the state callbacks as the
system is already scheduling and handles interrupts and soft
interrupts. Haven't analyzed that yet in detail.

In the prepare stage which runs on the control CPU the larger steps are:

smpcfd:prepare 16us avg/CPU
threads:prepare 98us avg/CPU
workqueue:prepare 43us avg/CPU
trace/RB:prepare 135us avg/CPU

The trace ringbuffer initialization allocates 354 pages and 354 control
structures one by one. That probably should allocate a large page and an
array of control structures and work from there. I'm sure that would reduce
this significantly. Steven?

smpcfd does just a percpu allocation. No idea why that takes that long.

Vs. threads and workqueues. David thought about spreading out the
preparation work and do it really in parallel. That's a nice idea, but the
threads and workqueue prepare steps are self serializing. The workqueue one
has a global mutex and aside of that both steps create kernel threads which
implicitely serialize on kthreadd. alloc_percpu(), which is used by
smpcfd:prepare is also globally serialized.

The rest of the prepare steps is pretty much in the single digit
microseconds range.

On the AP side it should be possible to move some of the initialization
steps before the alive synchronization point, but that really needs a lot
of analysis whether the functions are safe to invoke that early and outside
of the cpu_hotplug_lock held region for the case of two stage parallel
bringup; see below.

The largest part is:

identify_secondary_cpu() 99us avg/CPU

Inside of identify_secondary_cpu() the largest offender:

mcheck_init() 73us avg/CPU

This part is definitly worth to be looked at whether it can be at least
partially moved to the early startup code before the alive
synchronization point. There's a lot of deep analysis required and
ideally we just rewrite the whole CPUID evaluation trainwreck
completely.

The rest of the AP side is low single digit microseconds except of:

perf/x86:starting 14us avg/CPU

smpboot/threads:online 13us avg/CPU
workqueue:online 17us avg/CPU
mm/vmstat:online 17us avg/CPU
sched:active 30us avg/CPU

sched:active is special. Onlining the first secondary HT thread on the
second socket creates a 3.2ms outlier which skews the whole picture. That's
caused by enabling the static key sched_smt_present which patches the world
and some more. For all other APs this is really in the 1us range. This
definitely could be postponed during bootup like the scheduler domain
rebuild is done after the bringup. But that's still fully serialized and
single threaded and obviously could be done later in the context of async
parallel init. It's unclear why this is different with the fully serialized
bringup where it takes significantly less time, but that's something which
needs to be investigated.


Is truly parallel bringup feasible?
-----------------------------------

In theory yes, realistically no. Why?

1) The preparation phase

Allocating memory, creating threads for the to be brought up CPU must
obviously happen on an already online CPU.

While it would be possible to bring up a subset of CPUs first and let
them do the preparation steps for groups of still offline CPUs
concurrently, the actual benefit of doing so is dubious.

The prime example is kernel thread creation, which is implicitely
serialized on kthreadd.

A simple experiment shows that 4 concurrent workers on 4 different
CPUs where each is creating 14 * 5 = 70 kernel threads are 5% slower
than a single worker creating 4 * 14 * 5 = 280 threads.

So we'd need to have multiple kthreadd instances to handle that,
which would then serialize on tasklist lock and other things.

That aside the preparation phase is also affected by the problem
below.

2) Assumptions about hotplug serialization

a) There are quite some assumptions about CPU bringup being fully
serialized across state transitions. A lot of state callbacks rely
on that and would require local locking.

Adding that local locking is surely possible, but that has several
downsides:

- It adds complexity and makes it harder for developers to get
this correct. The subtle bugs resulting out of that are going
to be interesting

- Fine grained locking has a charm, but only if the time spent
for the actual work is larger than the time required for
serialization and synchronization.

Serializing a callback which takes less than a microsecond and
then having a large number of CPUs contending on the lock will
not make it any faster at all. That's a well known issue of
parallelizing and neither made up nor kernel specific.

b) Some operations definitely require to be protected by the
cpu_hotplug_lock, especially those which affect cpumasks as the
masks are guaranteed to be stable in a cpus_read_lock()'ed region.

As this lock cannot be taken in atomic contexts, it's required
that the control CPU holds the lock write locked across these
state transitions. And no, we are not making this a spinlock just
for that and we even can't.

Just slapping a lock into the x86 specific part of the cpumask
update function does not solve anything. The relevant patch in V17
is completely useless as it only serializes the actual cpumask/map
modifications, but all read side users are hosed if the update
would be moved before the alive synchronization point, i.e. into a
non hotplug lock protected region.

Even if the hotplug lock would be held accross the whole parallel
bringup operation then this would still expose all usage of these
masks and maps in the actual hotplug state callbacks to concurrent
modifications.

And no, we are not going to expose an architecture specific raw
spinlock to the hotplug state callbacks, especially not to those
in generic code.

c) Some cpu_read_lock()'ed regions also expect that there is no CPU
state transition happening which would modify their local
state. This would again require local serialization.

3) The amount of work and churn:

- Analyze the per architecture low level startup functions plus their
descendant functions and make them ready for concurrency if
necessary.

- Analyze ~300 hotplug state callbacks and their descendant functions
and make them ready for concurrency if necessary.

- Analyze all cpus_read_lock()'ed regions and address their
requirements.

- Rewrite the core code to handle the cpu_hotplug_lock requirements
only in distinct phases of the state machine.

- Rewrite the core code to handle state callback failure and the
related rollback in the context of the new rules.

- ...

Even if some people are dedicated enough to do that, it's very
questionable whether the resulting complexity is justified.

We've spent a serious amount of time to sanitize hotplug and bring it
into a state where it is correct. This also made it reasonably simple
for developers to implement hotplug state callbacks without having to
become hotplug experts.

Breaking this completely up will result in a flood of hard to diagnose
subtle issues for sure. Who is going to deal with them?

The experience with this series so far does not make me comfortable
about that thought in any way.


Summary
-------

The obvious and low hanging fruits have to be solved first:

- The CPUID evaluation and related setup mechanisms

- The trace/ringbuffer oddity

- The sched:active oddity for the first sibling on the second socket

- Some other expensive things which I'm not seeing in my test setup due
to lack of hardware or configuration.

Anything else is pretty much wishful thinking in my opinion.

To be clear. I'm not standing in the way if there is a proper solution,
but that requires to respect the basic engineering rules:

1) Correctness first
2) Keep it maintainable
3) Keep it simple

So far this stuff failed already at #1.

I completely understand why this is important for cloud people, but
the real question to ask here is what are the actual requirements.

As far as I understand the main goal is to make a (kexec) reboot
almost invisible to VM tenants.

Now lets look at how this works:

A) Freeze VMs and persist state
B) kexec into the new kernel
C) Restore VMs from persistant memory
D) Thaw VMs

So the key problem is how long it takes to get from #B to #C and finally
to #D.

As far as I understand #C takes a serious amount of time and cannot be
parallelized for whatever reasons.

At the same time the number of online CPUs required to restore the VMs
state is less than the number of online CPUs required to actually
operate them in #D.

That means it would be good enough to return to userspace with a
limited number of online CPUs as fast as possible. A certain amount of
CPUs are going to be busy with restoring the VMs state, i.e. one CPU
per VM. Some remaining non-busy CPU can bringup the rest of the system
and the APs in order to be functional for #D, i.e the restore of VM
operation.

Trying to optimize this purely in kernel space by adding complexity of
dubious value is simply bogus in my opinion.

It's already possible today to limit the number of CPUs which are
initially onlined and online the rest later from user space.

There are two issue there:

a) The death by MCE broadcast problem

Quite some (contemporary) x86 CPU generations are affected by
this:

- MCE can be broadcasted to all CPUs and not only issued locally
to the CPU which triggered it.

- Any CPU which has CR4.MCE == 0, even if it sits in a wait
for INIT/SIPI state, will cause an immediate shutdown of the
machine if a broadcasted MCE is delivered.

b) Do the parallel bringup via sysfs control knob

The per CPU target state interface allows to do that today one
by one, but it's akward and has quite some overhead.

A knob to online the rest of the not yet onlined present CPUs
with the benefit of the parallel bringup mechanism is
missing.

#a) That's a risk to take by the operator.

Even the regular serialized bringup does not protect against this
issue up to the point where all present CPUs have at least
initialized CR4.

Limiting the number of APs to online early via the kernel command
line widens that window and increases the risk further by
executing user space before all APs have CR4 initialized.

But the same applies to a deferred online mechanism implemented in
the kernel where some worker brings up the not yet online APs while
the early online CPUs are already executing user space code.

#b) Is a no brainer to implement on top of this.


Conclusion
----------

Adding the basic parallel bringup mechanism as provided by this series
makes a lot of sense. Improving particular issues as pointed out in the
analysis makes sense too.

But trying to solve an application specific problem fully in the kernel
with tons of complexity, without exploring straight forward and simple
approaches first, does not make any sense at all.

Thanks,

tglx

---
Documentation/admin-guide/kernel-parameters.txt | 20
Documentation/core-api/cpu_hotplug.rst | 13
arch/Kconfig | 23 +
arch/arm/Kconfig | 1
arch/arm/include/asm/smp.h | 2
arch/arm/kernel/smp.c | 18
arch/arm64/Kconfig | 1
arch/arm64/include/asm/smp.h | 2
arch/arm64/kernel/smp.c | 14
arch/csky/Kconfig | 1
arch/csky/include/asm/smp.h | 2
arch/csky/kernel/smp.c | 8
arch/mips/Kconfig | 1
arch/mips/cavium-octeon/smp.c | 1
arch/mips/include/asm/smp-ops.h | 1
arch/mips/kernel/smp-bmips.c | 1
arch/mips/kernel/smp-cps.c | 14
arch/mips/kernel/smp.c | 8
arch/mips/loongson64/smp.c | 1
arch/parisc/Kconfig | 1
arch/parisc/kernel/process.c | 4
arch/parisc/kernel/smp.c | 7
arch/riscv/Kconfig | 1
arch/riscv/include/asm/smp.h | 2
arch/riscv/kernel/cpu-hotplug.c | 14
arch/x86/Kconfig | 45 --
arch/x86/include/asm/apic.h | 5
arch/x86/include/asm/cpu.h | 5
arch/x86/include/asm/cpumask.h | 5
arch/x86/include/asm/processor.h | 1
arch/x86/include/asm/realmode.h | 3
arch/x86/include/asm/sev-common.h | 3
arch/x86/include/asm/smp.h | 26 -
arch/x86/include/asm/topology.h | 23 -
arch/x86/include/asm/tsc.h | 2
arch/x86/kernel/acpi/sleep.c | 9
arch/x86/kernel/apic/apic.c | 22 -
arch/x86/kernel/callthunks.c | 4
arch/x86/kernel/cpu/amd.c | 2
arch/x86/kernel/cpu/cacheinfo.c | 21
arch/x86/kernel/cpu/common.c | 50 --
arch/x86/kernel/cpu/topology.c | 3
arch/x86/kernel/head_32.S | 14
arch/x86/kernel/head_64.S | 121 +++++
arch/x86/kernel/sev.c | 2
arch/x86/kernel/smp.c | 3
arch/x86/kernel/smpboot.c | 508 ++++++++----------------
arch/x86/kernel/topology.c | 98 ----
arch/x86/kernel/tsc.c | 20
arch/x86/kernel/tsc_sync.c | 36 -
arch/x86/power/cpu.c | 37 -
arch/x86/realmode/init.c | 3
arch/x86/realmode/rm/trampoline_64.S | 27 +
arch/x86/xen/enlighten_hvm.c | 11
arch/x86/xen/smp_hvm.c | 16
arch/x86/xen/smp_pv.c | 56 +-
drivers/acpi/processor_idle.c | 4
include/linux/cpu.h | 4
include/linux/cpuhotplug.h | 17
kernel/cpu.c | 397 +++++++++++++++++-
kernel/smp.c | 2
kernel/smpboot.c | 163 -------
62 files changed, 953 insertions(+), 976 deletions(-)