Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke

From: Gowans, James
Date: Tue Apr 11 2023 - 06:27:36 EST


Hi Marc, thanks for taking the time to put thought into this!

On Sun, 2023-04-09 at 12:41 +0100, Marc Zyngier wrote:
> > Generally it should not be possible for the next interrupt to arrive
> > while the previous handler is still running: the next interrupt should
> > only arrive after the EOI message has been sent and the previous handler
> > has returned.
>
> There isn't necessarily an EOI message. On bare metal, EOI of a LPI
> amounts to sweet nothing (see the pseudocode to convince yourself),
> because the physical LPI doesn't have an active state.

Okay, I saw the gic_eoi_irq() function and thought there might be something
going on there. I guess it's not relevant to this issue.

> >
> > The issue is
> > that the global ITS is responsible for affinity but does not know
> > whether interrupts are pending/running, only the CPU-local redistributor
> > handles the EOI. Hence when the affinity is changed in the ITS, the new
> > CPU's redistributor does not know that the original CPU is still running
> > the handler.
>
> You seem to be misunderstanding the architecture.
>
> The local redistributor doesn't know anything either. A redistributor
> doesn't contain any state about what is currently serviced. At best,
> it knows of the pending state of interrupts, and that about it. This
> is even more true in a VM. Only the CPU knows about this, and there is
> no EOI that ever gets propagated to the redistributor.

Right. This misunderstanding basically stems from my confusion above where I
thought that the EOI would move it back into "pending." That obviously makes no
sense because if another IRQ arrives while the handler is already running, then
another handler run must be queued, hence it must be something really early in
the flow which ACKs the IRQ so that it gets moved back to "inactive."

> > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > an edge IRQ so not sure why it needs to be masked.
>
> Please measure that cost and weep, specially in the context of
> multiple concurrent interrupts serviced by a single ITS (cost of
> locking the command queue, of waiting for a full round trip to the ITS
> for a couple of commands...).

Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
race condition described here being hit. Exactly the same as
with handle_edge_irq(), the masking and later unmasking would only be done
when irq_may_run() == false due to the race being hit. Considering that this is
a rare occurrence, I think we could stomach the occasional overhead? I was more
asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
it's necessary anyway, hence it wasn't implemented in my patch.

> > index 49e7bc871fec..4e5fc2b7e8a9 100644
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> >
> > raw_spin_lock(&desc->lock);
> >
> > - if (!irq_may_run(desc))
> > + if (!irq_may_run(desc)) {
> > + desc->istate |= IRQS_PENDING;
> > goto out;
> > + }
>
> What is currently unclear to me is how we get there on another CPU if
> we're still handling the interrupt, which happens in the critical
> section delineated by desc->lock. So there is some additional state
> change here which you don't seem to be describing.

The lock is actually released and later re-acquired in the handle_irq_event()
function which is called below. Only this flag wrangling code is done with the
irq_desc lock held; all of the logic in the later IRQ-specific handler code is
actually run unlocked. This is why there is a window for the irq_desc to get
into the irq_may_run() function concurrently with the handler being run on a
difference CPU.
>
> > desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
> >
> > @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> > if (desc->istate & IRQS_ONESHOT)
> > mask_irq(desc);
> >
> > - handle_irq_event(desc);
> > + do {
> > + handle_irq_event(desc);
> > + } while (unlikely((desc->istate & IRQS_PENDING) &&
> > + !irqd_irq_disabled(&desc->irq_data)));
> >
> > cond_unmask_eoi_irq(desc, chip);
> >
>
> Why do we need to inflict any of this on all other users of this flow?
> The lack of active state on LPIs is the issue, as it allows a new
> interrupt to be presented to another CPU as soon as the previous one
> has been ACK'ed.

Do you mean an active state in hardware? Even if we had an active state, that
state would be CPU-local, right? The issue here is that when the CPU affinity
changes while the handler is running, the new CPU will early out because the
flags in irq_desc indicate it's already running elsewhere. A CPU-local hardware
active state would not help here; only if the active/pending states were global,
then it may help.

As for inflicting this on other users, this is pretty light in general. It's
basically just looking at that flag again; in general it will be unset and we'll
exit the loop after one invoke.

>
> This also has the effect of keeping the handling of the interrupt on
> the original CPU, negating the effects of the change of affinity.

Yes. This bothered me too initially, but on reflection I'm not sure it's
actually a problem. One possible issue that came to mind was around CPU
offlining, but in the event that a CPU being offlined was running interrupt
handlers it wouldn't be able to complete the offline anyway until the handlers
were finished, so I don't think this is an issue. Do you see any practical issue
with running the handler once more on the original CPU immediately after the
affinity has been changed?

> It feels that we should instead resend the interrupt, either by making
> it pending again by generating an INT command to the ITS, or by using
> the SW resend mechanism (which the lack of deactivation requirement
> makes possible).

I'm open to other suggestions here, just not sure how this re-sending would work
in practice. Do you mean that when the original CPU observes that the pending
flag is set, instead of running the handler again locally the original CPU would
instruct the ITS to mark the interrupt as pending again, hence re-triggering it
on the new CPU? That could work, but may be tricky to shoe-horn into the generic
code, unless we use the EOI callback for this?

Fundamentally I don't think that the solution here should be GIC specific. We
hit this issue on the GIC, but potentially any irq chip could cause this lost
interrupt issue to manifest? That's one of the reasons I preferred a tweak to
the generic code to fix this for everyone.

JG

[0] https://developer.arm.com/documentation/102923/0100/Redistributors/Initial-configuration-of-a-Redistributor?lang=en