Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm

From: Doug Anderson
Date: Wed Jan 24 2024 - 19:20:03 EST


Hi,

On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx> wrote:
>
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> ...
> Call trace:
> __do_softirq+0xa0/0x37c
> __irq_exit_rcu+0x108/0x140
> irq_exit+0x14/0x20
> __handle_domain_irq+0x84/0xe0
> gic_handle_irq+0x80/0x108
> el0_irq_naked+0x50/0x58
>
> Therefore,I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> CPU#28 Utilization every 4s during lockup:
> #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
> #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
> #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
> #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
> #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
> ...
>
> This would be helpful in determining whether an interrupt storm has
> occurred or in identifying the cause of the softlockup. The criteria for
> determination are as follows:
> a. If the hardirq utilization is high, then interrupt storm should be
> considered and the root cause cannot be determined from the call tree.
> b. If the softirq utilization is high, then we could analyze the call
> tree but it may cannot reflect the root cause.
> c. If the system utilization is high, then we could analyze the root
> cause from the call tree.

Just to set the tone: this sounds like a great idea! I've found myself
wanting something like this for a long time but I've never sat down to
try to code up a mechanism. I have review comments below, but mostly
it's just details. IMO the idea is definitely something we want to
land.


> Signed-off-by: Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx>
> ---
> kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 58 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 81a8862295d6..9fad10e0a147 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -23,6 +23,8 @@
> #include <linux/sched/debug.h>
> #include <linux/sched/isolation.h>
> #include <linux/stop_machine.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/math64.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
> return 0;
> }
>
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
> +static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
> +static DEFINE_PER_CPU(int, cpustat_tail);

The above is potentially a lot of memory. On some configs we allow up
to 8192 CPUs and you're storing (NR_STATS * 6) u64s per CPU plus
another int per CPU . NR_STATS can be up to 11 currently, so if I did
my math right then that's

8 * 8192 * 11 * 6 + 4 * 8192 = 4,358,144 bytes (!) That's probably not OK.

In theory you only need 4 stats (plus the total) instead of 11. If you
do that plus the "old" then that would give you (8 * 8192 * 5 * 6 + 4
* 8192) = 1,998,848 bytes. That's still a lot, but at least less.
..so at the very least you should only store the fields you need.

Given the amount of potential space at stake, it's probably worth
optimizing this to something smaller than u64 as well. It feels like
it would be easy to use u32. You only need diffs here and it doesn't
seem like it would be hard to figure out how to use 32-bits, even if
you have the drop a bit of granularity.

It feels like you could even go down to 8-bits per stat, though. You
don't _really_ need such accurate percentages, right? Does it really
matter if you have 99.4%? It seems like just saying 99% would be fine.
Just do the math here and store the integral percentage per stat you
care about. The numbers 0 - 100 can be stored in 8 bits.

Combining that all together, I guess you could do this (untested):

1. Get the current stats and right shift them all by 10 and convert
them to 32-bit. This gives you (roughly) microseconds which will roll
over roughly every ~4000 seconds. That seems about right, though I
guess it's possible to do more and see if we can squeeze into 16 bits.

2. Sum up all 11 of the stats to get a total. This should still fit
into 32-bits.

3. Do the math to get the integral percentages and store those in the tail slot.

4. Store the 4 stats you care about plus the total (all 32-bits) in
"cpustat_old".

If I've got that right, your definitions should be able to be:

#define NUM_OLD_STATS_GROUPS 5
#define NUM_STATS_PER_GROUP 4
static DEFINE_PER_CPU(u32, cpustat_old[NUM_STATS_PER_GROUP + 1]);
static DEFINE_PER_CPU(u8,
cpustat_utilization[NUM_OLD_STATS][NUM_STATS_PER_GROUP]);
static DEFINE_PER_CPU(u8, cpustat_tail);

With the maximum number of CPUs, that's now this, if I got my math right.

4 * 8192 * 5 + 1 * 8192 * 5 * 4 + 1 * 8192 = 335,872 bytes.

That's a lot less, but still a lot. I'd be interested to hear other
opinions, but it's probably worth a Kconfig knob.


> +static void update_cpustat(void)
> +{
> + u64 *old = this_cpu_ptr(cpustat_old);
> + u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> + int tail = this_cpu_read(cpustat_tail), i;

nit: Please define "i" on its own line. It looks weird the way you have it here.

> + struct kernel_cpustat kcpustat;
> + u64 *cpustat = kcpustat.cpustat;
> +
> + kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
> + for (i = 0; i < NR_STATS; i++) {
> + diff[tail][i] = cpustat[i] - old[i];
> + old[i] = cpustat[i];
> + }
> + this_cpu_write(cpustat_tail, (tail + 1) % 5);
> +}
> +
> +static void print_cpustat(void)
> +{
> + int i, j, k;
> + u64 a[5][NR_STATS], b[5][NR_STATS];
> + u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> + int tail = this_cpu_read(cpustat_tail);
> + u32 period_us = sample_period / 1000;
> +
> + for (i = 0; i < 5; i++) {
> + for (j = 0; j < NR_STATS; j++) {
> + a[i][j] = 100 * (diff[i][j] / 1000);
> + b[i][j] = 10 * do_div(a[i][j], period_us);
> + do_div(b[i][j], period_us);
> + }
> + }
> + printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
> + smp_processor_id(), period_us/1000000);
> + for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
> + printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
> + "%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
> + a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
> + a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
> + a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
> + a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);

As per your response to Liu Song, I understand why you're using
"printk" instead of pr_crit(), but it deserves a comment.