Re: [PATCH] powerpc/perf: Fix core-imc hotplug callback failure during imc initialization

From: Anju T Sudhakar
Date: Thu Nov 02 2017 - 04:49:04 EST


Hi,


On Wednesday 01 November 2017 06:22 AM, Michael Ellerman wrote:
Anju T Sudhakar <anju@xxxxxxxxxxxxxxxxxx> writes:

Call trace observed during boot:
What's the actual oops?

The actual oops is:

[ 0.750749] PCI: CLS 0 bytes, default 128
[ 0.750855] Unpacking initramfs...
[ 1.570445] Freeing initrd memory: 23168K
[ 1.571090] rtas_flash: no firmware flash support
[ 1.573873] nest_capp0_imc performance monitor hardware support registered
[ 1.574006] nest_capp1_imc performance monitor hardware support registered
[ 1.579616] core_imc memory allocation for cpu 56 failed
[ 1.579730] Unable to handle kernel paging request for data at address 0xffa400010
[ 1.579797] Faulting instruction address: 0xc000000000bf3294
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000000ff38ff8d0]
pc: c000000000bf3294: mutex_lock+0x34/0x90
lr: c000000000bf3288: mutex_lock+0x28/0x90
sp: c000000ff38ffb50
msr: 9000000002009033
dar: ffa400010
dsisr: 80000
current = 0xc000000ff383de00
paca = 0xc000000007ae0000 softe: 0 irq_happened: 0x01
pid = 13, comm = cpuhp/0
Linux version 4.11.0-39.el7a.ppc64le (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Oct 3 07:42:44 EDT 2017
0:mon> t
[c000000ff38ffb80] c0000000002ddfac perf_pmu_migrate_context+0xac/0x470
[c000000ff38ffc40] c00000000011385c ppc_core_imc_cpu_offline+0x1ac/0x1e0
[c000000ff38ffc90] c000000000125758 cpuhp_invoke_callback+0x198/0x5d0
[c000000ff38ffd00] c00000000012782c cpuhp_thread_fun+0x8c/0x3d0
[c000000ff38ffd60] c0000000001678d0 smpboot_thread_fn+0x290/0x2a0
[c000000ff38ffdc0] c00000000015ee78 kthread+0x168/0x1b0
[c000000ff38ffe30] c00000000000b368 ret_from_kernel_thread+0x5c/0x74


[c000000ff38ffb80] c0000000002ddfac perf_pmu_migrate_context+0xac/0x470
[c000000ff38ffc40] c00000000011385c ppc_core_imc_cpu_offline+0x1ac/0x1e0
[c000000ff38ffc90] c000000000125758 cpuhp_invoke_callback+0x198/0x5d0
[c000000ff38ffd00] c00000000012782c cpuhp_thread_fun+0x8c/0x3d0
[c000000ff38ffd60] c0000000001678d0 smpboot_thread_fn+0x290/0x2a0
[c000000ff38ffdc0] c00000000015ee78 kthread+0x168/0x1b0
[c000000ff38ffe30] c00000000000b368 ret_from_kernel_thread+0x5c/0x74

While registering the cpuhoplug callbacks for core-imc, if we fails
in the cpuhotplug online path for any random core (either because opal call to
initialize the core-imc counters fails or because memory allocation fails for
that core), ppc_core_imc_cpu_offline() will get invoked for other cpus who
successfully returned from cpuhotplug online path.

But in the ppc_core_imc_cpu_offline() path we are trying to migrate the event
context, when core-imc counters are not even initialized. Thus creating the
above stack dump.

Add a check to see if core-imc counters are enabled or not in the cpuhotplug
offline path before migrating the context to handle this failing scenario.
Why do we need a bool to track this? Can't we just check the data
structure we're deinitialising has been initialised?

Doesn't this also mean we won't cleanup the initialisation for any CPUs
that have been initialised?
we do the cleanup in the failing case.




Thanks for the review.

Thanks,
Anju



cheers

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 8812624..08139f9 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -30,6 +30,7 @@ static struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
static cpumask_t nest_imc_cpumask;
struct imc_pmu_ref *nest_imc_refc;
static int nest_pmus;
+static bool core_imc_enabled;
/* Core IMC data structures and variables */
@@ -607,6 +608,19 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
if (!cpumask_test_and_clear_cpu(cpu, &core_imc_cpumask))
return 0;
+ /*
+ * See if core imc counters are enabled or not.
+ *
+ * Suppose we reach here from core_imc_cpumask_init(),
+ * since we failed at the cpuhotplug online path for any random
+ * core (either because opal call to initialize the core-imc counters
+ * failed or because memory allocation failed).
+ * We need to check whether core imc counters are enabled or not before
+ * migrating the event context from cpus in the other cores.
+ */
+ if (!core_imc_enabled)
+ return 0;
+
/* Find any online cpu in that core except the current "cpu" */
ncpu = cpumask_any_but(cpu_sibling_mask(cpu), cpu);
@@ -1299,6 +1313,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
return ret;
}
+ core_imc_enabled = true;
break;
case IMC_DOMAIN_THREAD:
ret = thread_imc_cpu_init();
--
2.7.4