[PATCHv7 0/2] *** Detect interrupt storm in softlockup ***
From: Bitao Hu
Date: Tue Feb 13 2024 - 21:14:55 EST
Hi, guys.
I have implemented a low-overhead method for detecting interrupt
storm in softlockup. Please review it, all comments are welcome.
Changes from v6 to v7:
- Remove "READ_ONCE" in "start_counting_irqs"
- Replace the hard-coded 5 with "NUM_SAMPLE_PERIODS" macro in
"set_sample_period".
- Add empty lines to help with reading the code.
- Remove the branch that processes IRQs where "counts_diff = 0".
- Add the Reviewed-by of Liu Song and Douglas.
Changes from v5 to v6:
- Use "./scripts/checkpatch.pl --strict" to get a few extra
style nits and fix them.
- Squash patch #3 into patch #1, and wrapp the help text to
80 columns.
- Sort existing headers alphabetically in watchdog.c
- Drop "softlockup_hardirq_cpus", just read "hardirq_counts"
and see if it's non-NULL.
- Store "nr_irqs" in a local variable.
- Simplify the calculation of "cpu_diff".
Changes from v4 to v5:
- Rearranging variable placement to make code look neater.
Changes from v3 to v4:
- Renaming some variable and function names to make the code logic
more readable.
- Change the code location to avoid predeclaring.
- Just swap rather than a double loop in tabulate_irq_count.
- Since nr_irqs has the potential to grow at runtime, bounds-check
logic has been implemented.
- Add SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob.
Changes from v2 to v3:
- From Liu Song, using enum instead of macro for cpu_stats, shortening
the name 'idx_to_stat' to 'stats', adding 'get_16bit_precesion' instead
of using right shift operations, and using 'struct irq_counts'.
- From kernel robot test, using '__this_cpu_read' and '__this_cpu_write'
instead of accessing to an per-cpu array directly, in order to avoid
this warning.
'sparse: incorrect type in initializer (different modifiers)'
Changes from v1 to v2:
- From Douglas, optimize the memory of cpustats. With the maximum number
of CPUs, that's now this.
2 * 8192 * 4 + 1 * 8192 * 5 * 4 + 1 * 8192 = 237,568 bytes.
- From Liu Song, refactor the code format and add necessary comments.
- From Douglas, use interrupt counts instead of interrupt time to
determine the cause of softlockup.
- Remove the cmdline parameter added in PATCHv1.
Bitao Hu (2):
watchdog/softlockup: low-overhead detection of interrupt
watchdog/softlockup: report the most frequent interrupts
kernel/watchdog.c | 255 +++++++++++++++++++++++++++++++++++++++++++++-
lib/Kconfig.debug | 13 +++
2 files changed, 263 insertions(+), 5 deletions(-)
--
2.37.1 (Apple Git-137.1)