Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke

From: Marc Zyngier
Date: Sun Apr 09 2023 - 07:42:06 EST


James,

On Fri, 17 Mar 2023 09:53:00 +0000,
James Gowans <jgowans@xxxxxxxxxx> wrote:
>
> Update the generic handle_fasteoi_irq to cater for the case when the
> next interrupt comes in while the previous handler is still running.
> Currently when that happens the irq_may_run() early out causes the next
> IRQ to be lost. Change the behaviour to mark the interrupt as pending
> and re-run the handler when handle_fasteoi_irq sees that the pending
> flag has been set again. This is largely inspired by handle_edge_irq.
>
> Generally it should not be possible for the next interrupt to arrive
> while the previous handler is still running: the next interrupt should
> only arrive after the EOI message has been sent and the previous handler
> has returned.

There isn't necessarily an EOI message. On bare metal, EOI of a LPI
amounts to sweet nothing (see the pseudocode to convince yourself),
because the physical LPI doesn't have an active state. When
virtualised, the active state is limited to the state stored in the
LRs, and isn't propagated anywhere else.

> However, there is a race where if the interrupt affinity
> is changed while the previous handler is running, then the next
> interrupt can arrive at a different CPU while the previous handler is
> still running. In that case there will be a concurrent invoke and the
> early out will be taken.
>
> For example:
>
> CPU 0 | CPU 1
> -----------------------------|-----------------------------
> interrupt start |
> handle_fasteoi_irq | set_affinity(CPU 1)
> handler |
> ... | interrupt start
> ... | handle_fasteoi_irq -> early out
> handle_fasteoi_irq return | interrupt end
> interrupt end |
>
> This issue was observed specifically on an arm64 system with a GIC-v3
> handling MSIs; GIC-v3 uses the handle_fasteoi_irq handler. The issue is
> that the global ITS is responsible for affinity but does not know
> whether interrupts are pending/running, only the CPU-local redistributor
> handles the EOI. Hence when the affinity is changed in the ITS, the new
> CPU's redistributor does not know that the original CPU is still running
> the handler.

You seem to be misunderstanding the architecture.

The local redistributor doesn't know anything either. A redistributor
doesn't contain any state about what is currently serviced. At best,
it knows of the pending state of interrupts, and that about it. This
is even more true in a VM. Only the CPU knows about this, and there is
no EOI that ever gets propagated to the redistributor.

>
> There are a few uncertainties about this implementation compared to the
> prior art in handle_edge_irq:
>
> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> but it's not entirely clear why handle_edge_irq does this anyway; it's
> an edge IRQ so not sure why it needs to be masked.

Please measure that cost and weep, specially in the context of
multiple concurrent interrupts serviced by a single ITS (cost of
locking the command queue, of waiting for a full round trip to the ITS
for a couple of commands...).

>
> 2. Should the EOI delivery be inside the do loop after every handler
> run? I think outside the loops is best; only inform the chip to deliver
> more IRQs once all pending runs have been finished.
>
> 3. Do we need to check that desc->action is still set? I don't know if
> it can be un-set while the IRQ is still enabled.
>
> 4. There is now more overlap with the handle_edge_eoi_irq handler;
> should we try to unify these?
>
> Signed-off-by: James Gowans <jgowans@xxxxxxxxxx>
> Reviewed-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Marc Zyngier <maz@xxxxxxxxxx>
> Cc: KarimAllah Raslan <karahmed@xxxxxxxxxx>
> ---
> Documentation/core-api/genericirq.rst | 9 ++++++++-
> kernel/irq/chip.c | 9 +++++++--
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
> index f959c9b53f61..b54485eca8b5 100644
> --- a/Documentation/core-api/genericirq.rst
> +++ b/Documentation/core-api/genericirq.rst
> @@ -240,7 +240,14 @@ which only need an EOI at the end of the handler.
>
> The following control flow is implemented (simplified excerpt)::
>
> - handle_irq_event(desc->action);
> + if (desc->status & running) {
> + desc->status |= pending;
> + return;
> + }
> + do {
> + desc->status &= ~pending;
> + handle_irq_event(desc->action);
> + } while (status & pending);
> desc->irq_data.chip->irq_eoi();
>
>
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..4e5fc2b7e8a9 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>
> raw_spin_lock(&desc->lock);
>
> - if (!irq_may_run(desc))
> + if (!irq_may_run(desc)) {
> + desc->istate |= IRQS_PENDING;
> goto out;
> + }

What is currently unclear to me is how we get there on another CPU if
we're still handling the interrupt, which happens in the critical
section delineated by desc->lock. So there is some additional state
change here which you don't seem to be describing.

>
> desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>
> @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> if (desc->istate & IRQS_ONESHOT)
> mask_irq(desc);
>
> - handle_irq_event(desc);
> + do {
> + handle_irq_event(desc);
> + } while (unlikely((desc->istate & IRQS_PENDING) &&
> + !irqd_irq_disabled(&desc->irq_data)));
>
> cond_unmask_eoi_irq(desc, chip);
>

Why do we need to inflict any of this on all other users of this flow?
The lack of active state on LPIs is the issue, as it allows a new
interrupt to be presented to another CPU as soon as the previous one
has been ACK'ed.

This also has the effect of keeping the handling of the interrupt on
the original CPU, negating the effects of the change of affinity.

It feels that we should instead resend the interrupt, either by making
it pending again by generating an INT command to the ITS, or by using
the SW resend mechanism (which the lack of deactivation requirement
makes possible).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.