Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

From: Leonardo Bras
Date: Thu Feb 22 2024 - 23:38:09 EST


On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> >> >> it back to IRQ_HANDLED.
> >> >
> >> > That's not really workable as you'd have to update tons of drivers just
> >> > to deal with that corner case. That's error prone and just extra
> >> > complexity all over the place.
> >
> > I agree, that's a downside of this implementation.
>
> A serious one which is not really workable. See below.
>
> > I agree the above may be able to solve the issue, but it would make 2 extra
> > atomic ops necessary in the thread handling the IRQ, as well as one extra
> > atomic operation in note_interrupt(), which could increase latency on this
> > IRQ deferring the handler to a thread.
> >
> > I mean, yes, the cpu running note_interrupt() would probably already have
> > exclusiveness for this cacheline, but it further increases cacheline
> > bouncing and also adds the mem barriers that incur on atomic operations,
> > even if we use an extra bit from threads_handled instead of allocate a new
> > field for threads_running.
>
> I think that's a strawman. Atomic operations can of course be more
> expensive than non-atomic ones, but they only start to make a difference
> when the cache line is contended. That's not the case here for the
> normal operations.
>
> Interrupts and their threads are strictly targeted to a single CPU and
> the cache line is already hot and had to be made exclusive because of
> other write operations to it.
>
> There is usually no concurrency at all, except for administrative
> operations like enable/disable or affinity changes. Those administrative
> operations are not high frequency and the resulting cache line bouncing
> is unavoidable even without that change. But does it matter in the
> larger picture? I don't think so.

That's a fair point, but there are some use cases that use CPU Isolation on
top of PREEMPT_RT in order to reduce interference on a CPU running an RT
workload.

For those cases, IIRC the handler will run on a different (housekeeping)
CPU when those IRQs originate on an Isolated CPU, meaning the above
described cacheline bouncing is expected.


>
> > On top of that, let's think on a scenario where the threaded handler will
> > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > This allows the thread to run for little time while solving a lot of
> > requests.
> >
> > In this scenario, note_interrupt() could return without incrementing
> > irqs_unhandled for those IRQ that happen while the brief thread is running,
> > but every other IRQ would cause note_interrupt() to increase
> > irqs_unhandled, which would cause the bug to still reproduce.
>
> In theory yes. Does it happen in practice?
>
> But that exposes a flaw in the actual detection code. The code is
> unconditionally accumulating if there is an unhandled interrupt within
> 100ms after the last unhandled one. IOW, if there is a periodic
> unhandled one every 50ms, the interrupt will be shut down after 100000 *
> 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> actually handled interrupts.
>
> The spurious detector is really about runaway interrupts which hog a CPU
> completely, but the above is not what we want to protect against.

Now it makes a lot more sense to me.
Thanks!

>
> > I understand my suggested change increases irq_return complexity, but it
> > should not increase much of the overhead in both IRQ deferring and IRQ
> > thread handling. Also, since it accounts for every IRQ actually handled, it
> > does not matter how long the handler thread runs, it still avoids the
> > bug.
>
> I'm not worried about the extra complexity in note_interrupt(). That's
> fine and I don't have a strong opinion whether using your patch 2/4 or
> the simpler one I suggested.
>
> > As you previously noted, the main issue in my suggestion is about changing
> > drivers' code. But while this change is optional, I wonder how many
> > drivers handle IRQs that:
> > - use edge type interrupts, and
>
> It's not up to the driver to decide that. That's a platform property and
> the driver does not even know and they should not care because all what
> matters for the driver is that it gets the interrupts and does not lose
> anything.
>
> > - can trigger really fast, many many times, and
>
> That's a hardware or emulation property and I don't know how many
> devices would expose this behaviour.
>
> > - can run in force-thread mode, and
>
> Almost all drivers.
>
> > - have handlers that are able to deal with multiple IRQs per call.
>
> Pretty much every driver which has to deal with queues, FIFOs
> etc. That's a lot of them.
>
> > About solving it directly in generic code, I agree it would be the ideal
> > scenario. That's why I suggested that in RFCv1: if the thread handles a
> > single IRQ, we reset irqs_unhandled to zero. That's a good pointer on the
> > thread being stuck, and TBH I don't think this is too far away from the
> > current 100/100k if we consider some of those handlers can handle multiple
> > IRQs at once.
>
> It has the downside that a scenario where there is a spurious interrupt
> flood with an occasional one inbetween which is handled by the thread
> will not be detected anymore. The accumulation and time period tracking
> are there for a reason.

Ok, fair point.
But if we disable it, we end up not having even those few successes.

But now that I understand the point is preventing the CPU from hogging, it
makes sense: we prefer disabling the interrupt than compomising the system.
It also makes a warning that may help to track down the driver/device
issue.

>
> But as I pointed out above the detection logic is flawed due to the
> unconditional accumulation. Can you give the uncompiled below a test
> ride with your scenario?
>
> Thanks,
>
> tglx
> ---
> include/linux/irqdesc.h | 2 ++
> kernel/irq/spurious.c | 33 +++++++++++++++++++++++++++++----
> 2 files changed, 31 insertions(+), 4 deletions(-)
>
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -29,6 +29,7 @@ struct pt_regs;
> * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers
> * @tot_count: stats field for non-percpu irqs
> * @irq_count: stats field to detect stalled irqs
> + * @first_unhandled: Timestamp of the first unhandled interrupt
> * @last_unhandled: aging timer for unhandled count
> * @irqs_unhandled: stats field for spurious unhandled interrupts
> * @threads_handled: stats field for deferred spurious detection of threaded handlers
> @@ -64,6 +65,7 @@ struct irq_desc {
> unsigned int wake_depth; /* nested wake enables */
> unsigned int tot_count;
> unsigned int irq_count; /* For detecting broken IRQs */
> + unsigned long first_unhandled;
> unsigned long last_unhandled; /* Aging timer for unhandled count */
> unsigned int irqs_unhandled;
> atomic_t threads_handled;
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -390,9 +390,14 @@ void note_interrupt(struct irq_desc *des
> * working systems
> */
> if (time_after(jiffies, desc->last_unhandled + HZ/10))
> - desc->irqs_unhandled = 1;
> - else
> - desc->irqs_unhandled++;
> + desc->irqs_unhandled = 0;
> +
> + if (!desc->irqs_unhandled) {
> + desc->irq_count = 0;
> + desc->first_unhandled = jiffies;
> + }
> +
> + desc->irqs_unhandled++;
> desc->last_unhandled = jiffies;
> }
>
> @@ -411,9 +416,27 @@ void note_interrupt(struct irq_desc *des
> if (likely(desc->irq_count < 100000))
> return;
>
> - desc->irq_count = 0;
> if (unlikely(desc->irqs_unhandled > 99900)) {
> /*
> + * Validate that this is actually an interrupt storm, which
> + * happens to:
> + *
> + * - increment the unhandled count to ~100k within 10 seconds
> + * which means there is an spurious interrupt every 50us
> + * - have an unhandled to handled ratio > 2
> + *
> + * Otherwise reset the detector and start over. This also
> + * covers the case where a threaded handler consumes
> + * several hard interrupts in one go which would otherwise
> + * accumulate to 99900 over time and trigger a false positive.
> + */
> + unsigned long td = desc->last_unhandled - desc->first_unhandled;
> + unsigned int handled = desc->irq_count - desc->irqs_unhandled;
> +
> + if (td > 5 * HZ || desc->irqs_unhandled / handled < 3)
> + goto out;
> +
> + /*
> * The interrupt is stuck
> */
> __report_bad_irq(desc, action_ret);
> @@ -428,7 +451,9 @@ void note_interrupt(struct irq_desc *des
> mod_timer(&poll_spurious_irq_timer,
> jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
> }
> +out:
> desc->irqs_unhandled = 0;
> + desc->irq_count = 0;
> }
>
> bool noirqdebug __read_mostly;
>

Sure, I will give it a try.

Now having a better understanding of what is expected here, I will also try
to experiment a little here.

Thank you!
Leo