regmap irqs cause oops when hotpluging a cpu

From: James
Date: Thu Aug 24 2017 - 03:50:43 EST



I've been running linux on Intel cherry-trail devices and discovered that I
couldn't get into S4. Debugging I found that the system would hang if I
attempted to hotplug a CPU (as is done in the final UP to SMP transition in
S4).

The simplest repro was:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

looking closer it was exploding in

static void __setup_vector_irq(int cpu)
{
struct apic_chip_data *data;
struct irq_desc *desc;
int irq, vector;

/* Mark the inuse vectors */
for_each_irq_desc(irq, desc) {
struct irq_data *idata = irq_desc_get_irq_data(desc);

data = apic_chip_data(idata);
>>>> if (!data || !cpumask_test_cpu(cpu, data->domain))
continue;

because data->domain wasn't a valid pointer. (setup_vector_irq is called to
setup the vectors of the LAPIC on the new cpu.)

digging, it appears that the (l)apic code assumes that all interrupt domain
hierarchies end up at a lapic (since that's the only way an interrupt gets
delivered on x86)

to that end it does, for every interrupt on the platform:

static struct apic_chip_data *apic_chip_data(struct irq_data *irq_data)
{
if (!irq_data)
return NULL;

while (irq_data->parent_data)
irq_data = irq_data->parent_data;

return irq_data->chip_data;
}

Irq_data->chip_data is private void pointer for the use of the irq chip, but
this code doesn't bother checking that irq_data->chip is &lapic_controller
(it's own chip) because it must be true by plumbing [it assumes walking the
parents from any interrupt must end up at a lapic].

parent_data is meant to get set from domain->parent when the interrupt is
allocated in irq_domain_alloc_irq_data.

but for lots of the interrupts on cherry trail either domain->parent is unset
or the allocation of the irq doesn't follow this path. As a result the code
above explodes. (For example there are lots of regmap irq interrupts, and
regmap_add_irq_chip calls irq_domain_add_linear rather than
irq_domain_add_hierarchy which doesn't set parent)

I'm not sure what the right way(tm) to fix this is - as the place in irqdomain
where the interrupt is mapped likely doesn't know anything about the parent
irq. I expect all calls to add_linear need replacing, and the logic in
irqdomain.c needs to write ->parent_data.

[For a quick hack, perhaps the winning plan is to make setup_vector_irq just
put in a generic cpu bitmap for anything that isn't a lapic interrupt.]

James.