Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq

From: Bitao Hu
Date: Thu Jan 25 2024 - 02:51:14 EST




On 2024/1/25 08:19, Doug Anderson wrote:
Hi,

On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx> wrote:

When the watchdog determines that the current soft lockup is due
to an interrupt storm based on CPU utilization, reporting the
top three most time-consuming hardirq can be quite useful for
further troubleshooting.
Below is an example of interrupt storm. The call tree does not
provide useful information, but we can analyze which interrupt
caused the soft lockup by using the time-consuming information
of hardirq.

[ 67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
[ 67.714548] CPU#9 Utilization every 4s during lockup:
[ 67.714549] #1: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
[ 67.714551] #2: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714553] #3: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714555] #4: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
[ 67.714556] #5: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
[ 67.714571] #1: 99.9% irq#7(IPI)
[ 67.714575] #2: 0.0% irq#10(arch_timer)
[ 67.714578] #3: 0.0% irq#2(IPI)
...
[ 67.714654] Call trace:
[ 67.714656] __do_softirq+0xa8/0x364

Signed-off-by: Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx>
---
include/linux/irq.h | 9 ++
include/linux/irqdesc.h | 2 +
kernel/irq/irqdesc.c | 9 +-
kernel/watchdog.c | 186 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 205 insertions(+), 1 deletion(-)

Just like on patch #1, I'd overall like to say that I'm very excited
about this patch and think it'll help track down a lot of issues, so
please continue moving forward with it!
Sure, I will continue moving forward with it.

That being said, my gut says that this can be a lot simpler / lower
overhead if we just use the existing IRQ counts that the framework is
already keeping track of for us. We could look for the IRQs that had
the most counts on the soft locked CPU over the period of time.

It might not be quite as good/accurate compared to what you have here
but it seems like it would be "good enough". Normally when there's an
IRQ storm that's so bad that it's causing a softlockup then each IRQ
isn't actually that slow but they're going off nonstop. It would catch
that case perfectly.

I guess maybe (?) the case it wouldn't catch so well would be if you
had a loop that looked like:

* CPU from "storming" device takes ~10 ms and finishes.
* After the "storming" device finishes, we somehow manage to service a
whole pile of normal interrupts from non-storming devices.
* After the non-storming interrupts finished, then we went back to
servicing the slow storming interrupt.
I think there are two issues with using IRQ counts in determining interrupt storm:

1. we might not know the normal frequency for each interrupt, witch making the IRQ counts inconvenient for comparison.

For instance, in your example, suppose the slow interrupt takes 10ms at a frequency of 100Hz; the normal interrupt takes 10us at a frequency of 500Hz. If we use IRQ counts, we might incorrectly conclude that it is the normal interrupt causing softlockup.

2. During an interrupt storm, the IRQ counts recorded by the kernel does not necessarily increase.
The softlockup caused by an NVMe interrupt storm that I encountered is a case in point. Around the time of the softlockup, the NVMe interrupt frequency actually decreased from ~300 kHz to ~17 kHz. However, the average time taken per NVMe interrupt increased from ~2us to ~60us. This is because the nvme_irq handler is capable of polling and processing multiple interrupts reported by the NVMe controller, which are not captured in the kernel's interrupt statistics. The NVMe driver handles interrupts frequency is ~660 kHz actually.

Using interrupt time as a criterion of interrupt storm can avoid the above two points.


I haven't dug into the bowels of the Linux IRQ handling well enough to
know if that would be possible. However, even if it was possible it
doesn't feel likely to happen. If those "normal" interrupts aren't
storming then it seems unlikely we'd see more than one of each of them
between the "storming" interrupts. Thus it feels like the "storming"
interrupt would still have the highest (or tied for the highest) count
on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
the top 3) and you'd be pretty certain to catch even this case?

In any case, I'll wait before doing a more thorough review for now and
hope the above sounds right to you.

-Doug