Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm

From: Bitao Hu
Date: Thu Jan 25 2024 - 03:31:26 EST




On 2024/1/25 08:19, Doug Anderson wrote:
Hi,

On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx> wrote:

The following softlockup is caused by interrupt storm, but it cannot be
identified from the call tree. Because the call tree is just a snapshot
and doesn't fully capture the behavior of the CPU during the soft lockup.
watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
...
Call trace:
__do_softirq+0xa0/0x37c
__irq_exit_rcu+0x108/0x140
irq_exit+0x14/0x20
__handle_domain_irq+0x84/0xe0
gic_handle_irq+0x80/0x108
el0_irq_naked+0x50/0x58

Therefore,I think it is necessary to report CPU utilization during the
softlockup_thresh period (report once every sample_period, for a total
of 5 reportings), like this:
watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
CPU#28 Utilization every 4s during lockup:
#1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
#2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
#3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
#4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
#5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
...

This would be helpful in determining whether an interrupt storm has
occurred or in identifying the cause of the softlockup. The criteria for
determination are as follows:
a. If the hardirq utilization is high, then interrupt storm should be
considered and the root cause cannot be determined from the call tree.
b. If the softirq utilization is high, then we could analyze the call
tree but it may cannot reflect the root cause.
c. If the system utilization is high, then we could analyze the root
cause from the call tree.

Just to set the tone: this sounds like a great idea! I've found myself
wanting something like this for a long time but I've never sat down to
try to code up a mechanism. I have review comments below, but mostly
it's just details. IMO the idea is definitely something we want to
land.
Thanks for your support of this idea.


Signed-off-by: Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx>
---
kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 81a8862295d6..9fad10e0a147 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,8 @@
#include <linux/sched/debug.h>
#include <linux/sched/isolation.h>
#include <linux/stop_machine.h>
+#include <linux/kernel_stat.h>
+#include <linux/math64.h>

#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
return 0;
}

+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
+static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
+static DEFINE_PER_CPU(int, cpustat_tail);

The above is potentially a lot of memory. On some configs we allow up
to 8192 CPUs and you're storing (NR_STATS * 6) u64s per CPU plus
another int per CPU . NR_STATS can be up to 11 currently, so if I did
my math right then that's

8 * 8192 * 11 * 6 + 4 * 8192 = 4,358,144 bytes (!) That's probably not OK.

In theory you only need 4 stats (plus the total) instead of 11. If you
do that plus the "old" then that would give you (8 * 8192 * 5 * 6 + 4
* 8192) = 1,998,848 bytes. That's still a lot, but at least less.
...so at the very least you should only store the fields you need.

Given the amount of potential space at stake, it's probably worth
optimizing this to something smaller than u64 as well. It feels like
it would be easy to use u32. You only need diffs here and it doesn't
seem like it would be hard to figure out how to use 32-bits, even if
you have the drop a bit of granularity.

It feels like you could even go down to 8-bits per stat, though. You
don't _really_ need such accurate percentages, right? Does it really
matter if you have 99.4%? It seems like just saying 99% would be fine.
Just do the math here and store the integral percentage per stat you
care about. The numbers 0 - 100 can be stored in 8 bits.

Thanks for your detailed analysis and optimization of memory consumption. I indeed hadn't considered this point, and I will
update it in v2.
Combining that all together, I guess you could do this (untested):

1. Get the current stats and right shift them all by 10 and convert
them to 32-bit. This gives you (roughly) microseconds which will roll
over roughly every ~4000 seconds. That seems about right, though I
guess it's possible to do more and see if we can squeeze into 16 bits.

2. Sum up all 11 of the stats to get a total. This should still fit
into 32-bits.


3. Do the math to get the integral percentages and store those in the tail slot.

I'm a bit confused, why is there a need to sum up all 11 of the stats to
get a total? I calculate the percentage using sample_period.
4. Store the 4 stats you care about plus the total (all 32-bits) in
"cpustat_old".

If I've got that right, your definitions should be able to be:

#define NUM_OLD_STATS_GROUPS 5
#define NUM_STATS_PER_GROUP 4
static DEFINE_PER_CPU(u32, cpustat_old[NUM_STATS_PER_GROUP + 1]);
static DEFINE_PER_CPU(u8,
cpustat_utilization[NUM_OLD_STATS][NUM_STATS_PER_GROUP]);
static DEFINE_PER_CPU(u8, cpustat_tail);

With the maximum number of CPUs, that's now this, if I got my math right.

4 * 8192 * 5 + 1 * 8192 * 5 * 4 + 1 * 8192 = 335,872 bytes.

That's a lot less, but still a lot. I'd be interested to hear other
opinions, but it's probably worth a Kconfig knob.


+static void update_cpustat(void)
+{
+ u64 *old = this_cpu_ptr(cpustat_old);
+ u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
+ int tail = this_cpu_read(cpustat_tail), i;

nit: Please define "i" on its own line. It looks weird the way you have it here.
Sure.

+ struct kernel_cpustat kcpustat;
+ u64 *cpustat = kcpustat.cpustat;
+
+ kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
+ for (i = 0; i < NR_STATS; i++) {
+ diff[tail][i] = cpustat[i] - old[i];
+ old[i] = cpustat[i];
+ }
+ this_cpu_write(cpustat_tail, (tail + 1) % 5);
+}
+
+static void print_cpustat(void)
+{
+ int i, j, k;
+ u64 a[5][NR_STATS], b[5][NR_STATS];
+ u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
+ int tail = this_cpu_read(cpustat_tail);
+ u32 period_us = sample_period / 1000;
+
+ for (i = 0; i < 5; i++) {
+ for (j = 0; j < NR_STATS; j++) {
+ a[i][j] = 100 * (diff[i][j] / 1000);
+ b[i][j] = 10 * do_div(a[i][j], period_us);
+ do_div(b[i][j], period_us);
+ }
+ }
+ printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
+ smp_processor_id(), period_us/1000000);
+ for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
+ printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
+ "%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
+ a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
+ a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
+ a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
+ a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);

As per your response to Liu Song, I understand why you're using
"printk" instead of pr_crit(), but it deserves a comment.
Sure, I will add a comment here.