Re: [PATCH v5 0/4] x86: fix hang when AP bringup is too slow

From: Igor Mammedov
Date: Wed Jun 04 2014 - 09:22:14 EST


On Mon, 5 May 2014 22:49:47 +0200
Igor Mammedov <imammedo@xxxxxxxxxx> wrote:

> changes since v4:
> * merge "[PATCH v4 1/5] x86: fix list corruption on CPU hotplug"
> and "[PATCH v4 2/5] x86: fix memory corruption in acpi_unmap_lsapic()"
> together
> * "x86: initialize secondary CPU only if master CPU will wait for it:
> - add 10 seconds timeout description into commit message
> - add smp_mb() after clearing cpu_initialized_mask
>
> changes since v3:
> * put simple bugfixes first
> * move common part of syncing with master CPU in cpu_init()
> for x32/64 variant into helper function
> * cpu_init(): WARN_ON if cpu_initialized_mask is set
> * fix panic on CPU unplug, caused by erroneous removing
> of "pr->dev = dev;" in drivers/acpi/acpi_processor.c
>
> --
> Hang is observed on virtual machines during CPU hotplug,
> especially in big guests with many CPUs. (It happens more
> often if host is over-committed).
>
> Hang happens because master CPU timeouts on waiting till
> AP boots and 'cancels' CPU online operation assuming AP
> is not functional but AP may continue run wild later
> causing various hangs or panics in running kernel that
> is assuming that AP was offline.
>
> This is an alternative approach, that instead of canceling
> in-progress AP bringup (https://lkml.org/lkml/2014/3/6/257),
> removes timeouts so that AP bringup won't be affected by
> poor timing and syncs AP with master CPU at early startup
> making sure that AP won't run wild if master CPU doesn't
> expect AP to come online.
>
> Series also fixes 3 bugs found during testing CPU bringup
> failure case.

since 3.16 merge window is open now,
ping

> --
> Below is the detailed description of a more often happening hang:
> ---
> Master CPU may timeout before cpu_callin_mask is set and cancel
> booting CPU, but being onlined CPU still continues to boot, sets
> cpu_active_mask (CPU_STARTING notifiers) and spins in
> check_tsc_sync_target() for master cpu to arrive. Following attempt
> to online another cpu hangs in stop_machine, initiated from here:
> smp_callin ->
> smp_store_cpu_info ->
> identify_secondary_cpu ->
> mtrr_ap_init -> set_mtrr_from_inactive_cpu
>
> stop_machine waits on completion of stop_work on all CPUs from
> cpu_active_mask including a failed CPU that spins in check_tsc_sync_target().
>
> Igor Mammedov (4):
> x86: fix list/memory corruption on CPU hotplug
> acpi_processor: do not mark present at boot but not onlined CPU as
> onlined
> x86: log error on secondary CPU wakeup failure at ERR level
> x86: initialize secondary CPU only if master CPU will wait for it
>
> arch/x86/kernel/cpu/common.c | 27 ++++++----
> arch/x86/kernel/smpboot.c | 104 +++++++++++++----------------------------
> drivers/acpi/acpi_processor.c | 1 -
> 3 files changed, 48 insertions(+), 84 deletions(-)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/