Re: [PATCH 6/6] irqchip/apple-aic: Add support for AICv2

From: Marc Zyngier
Date: Mon Dec 20 2021 - 08:52:33 EST


On Sat, 18 Dec 2021 06:02:24 +0000,
Hector Martin <marcan@xxxxxxxxx> wrote:
>
> On 13/12/2021 03.47, Marc Zyngier wrote:
> >> + * This is followed by a set of event registers, each 16K page aligned.
> >> + * The first one is the AP event register we will use. Unfortunately,
> >> + * the actual implemented die count is not specified anywhere in the
> >> + * capability registers, so we have to explcitly specify the event
> >
> > explicitly
>
> Thanks, fixed!
>
> >
> >> + * register offset in the device tree to remain forward-compatible.
> >
> > Do the current machines actually have more than a single die?
>
> Not the current ones, but there are loud rumors everywhere of multi-die
> products... might as well try to support them ahead of time. The current
> machines *do* have two register sets, implying support for 2-die
> configurations, and although no IRQs are ever asserted from hardware,
> SW_GEN mode works and you can trigger die-ID 1 events.
>
> The interpretation of the capability registers comes from what the macOS
> driver does (that's the only part I looked at it for, since it's kind of
> hard to divine with only a single data point from the hardware). Their
> driver is definitely designed for multi die machines already. The
> register layout I worked out by probing the hardware; it was blatantly
> obvious that there was a second set of IRQ mask arrays after the first,
> that macOS didn't use (yet)...
>
> >> +static struct irq_chip aic2_chip = {
> >> + .name = "AIC2",
> >> + .irq_mask = aic_irq_mask,
> >> + .irq_unmask = aic_irq_unmask,
> >> + .irq_eoi = aic_irq_eoi,
> >> + .irq_set_type = aic_irq_set_type,
> >> +};
> >
> > How is the affinity managed if you don't have a callback? A number of
> > things are bound to break if you don't have one. And a description of
> > how an interrupt gets routed wouldn't go amiss!
>
> It isn't... we don't know all the details yet, but it seems to be Some
> Kind Of Magic™.
>
> There definitely is no way of individually mapping IRQs to specific
> CPUs; there just aren't enough implemented register bits to allow that.
>
> What we do have is a per-IRQ config consisting of:
>
> - Target CPU, 4 bits. This seems to be for pointing IRQs at coprocessors
> (there's actually an init dance to map a few IRQs to specific
> coprocessors; m1n1 takes care of that right now*). Only 0 sends IRQs to
> the AP here, so this is not useful to us.
>
> - IRQ config group, 3 bits. This selects one of 8 IRQ config registers.
> These do indirectly control how the IRQ is delivered; at least they have
> some kind of delay value (coalescing?) and I suspect may do some kind of
> priority control, though the details of that aren't clear yet. I don't
> recall seeing macOS do anything interesting with these groups, I think
> it always uses group 0.
>
> Then each CPU has an IMP-DEF sysreg that allows it to opt-in or opt-out
> of receiving IRQs (!). It actually has two bits, so there may be some
> kind of priority/subset control here too. By default all other CPUs are
> opted out, which isn't great... so m1n1 initializes it to opt in all
> CPUs to IRQ delivery.
>
> The actual delivery flow here seems to be something like AIC/something
> picks a CPU (using some kind of heuristic/CPU state? I noticed WFI seems
> to have an effect here) for initial delivery, and if the IRQ isn't acked
> in a timely manner, it punts and broadcasts the IRQ to all CPUs. The IRQ
> ack register is shared by all CPUs; I don't know if there is some kind
> of per-CPU difference in what it can return, but I haven't observed that
> yet, so I guess whatever CPU gets the IRQ gets to handle anything that
> is pending.
>
> There are also some extra features; e.g. there is definitely a set of
> registers for measuring IRQ latency (tells you how long it took from IRQ
> assertion to the CPU acking it). There's also some kind of global
> control over which CPU *cluster* is tried first for delivery (defaults
> to e-cluster, but you can change it to either p-cluster). We don't use
> those right now.
>
> So there is definitely room for further research here, but the current
> state of affairs is the driver doesn't do affinity at all, and IRQs are
> handled by "some" CPU. In practice, I see a decent (but not completely
> uniform) spread of which CPU handles any given IRQ. I assume it's
> something like it prefers a busy CPU, to avoid waking up a core just to
> handle an IRQ.

The main issue with such magic is that a number of things will break
in a spectacular way for a bunch of drivers. We have a whole class of
(mostly PCI) devices that have per-queue interrupts, each one bound to
a CPU core. The drivers fully expect the interrupt for a given queue
to fire on a given CPU, and *only* this one as they would, for
example, use per-CPU data to get the context of the queue.

With an automatic spread of interrupts, this totally breaks. Probably
because the core will refuse to use managed interrupts due to the lack
of affinity setting callback. And even if you provide a dummy one, it
is the endpoint drivers that will explode.

The only way I can imagine to make this work is to force these
interrupts to be threaded so that the thread can run on a different
CPU than the one the interrupt has been taken on. Performance-wise,
this is likely to be a pig.

I guess we will have to find ways to live with this in the long run,
and maybe teach the core code of these weird behaviours.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.