Re: [PATCHv6 1/2] watchdog/softlockup: low-overhead detection of interrupt

From: Petr Mladek
Date: Fri Feb 09 2024 - 08:35:43 EST


Hi,

I am sorry for jouning this game so late. But honestly, it went
forward too quickly. A good practice is to wait a week before
sending new version so that you give a chance more people
to provide some feedback.

The only exception might be when you know exactly who could
review it because the area in not interesting for anyone else.
But this is typicall not the case for kernel core code.


On Thu 2024-02-08 20:54:25, Bitao Hu wrote:
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> ...
> Call trace:
> __do_softirq+0xa0/0x37c
> __irq_exit_rcu+0x108/0x140
> irq_exit+0x14/0x20
> __handle_domain_irq+0x84/0xe0
> gic_handle_irq+0x80/0x108
> el0_irq_naked+0x50/0x58
>
> Therefore,I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> CPU#28 Utilization every 4s during lockup:
> #1: 0% system, 0% softirq, 100% hardirq, 0% idle
> #2: 0% system, 0% softirq, 100% hardirq, 0% idle
> #3: 0% system, 0% softirq, 100% hardirq, 0% idle
> #4: 0% system, 0% softirq, 100% hardirq, 0% idle
> #5: 0% system, 0% softirq, 100% hardirq, 0% idle

I like this. IMHO, it might be really useful.

> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -333,6 +335,92 @@ __setup("watchdog_thresh=", watchdog_thresh_setup);
>
> static void __lockup_detector_cleanup(void);
>
> +#ifdef CONFIG_SOFTLOCKUP_DETECTOR_INTR_STORM
> +#define NUM_STATS_GROUPS 5

It would be nice to synchronize this with the hardcoded 5 in:

static void set_sample_period(void)
{
/*
* convert watchdog_thresh from seconds to ns
* the divide by 5 is to give hrtimer several chances (two
* or three with the current relation between the soft
* and hard thresholds) to increment before the
* hardlockup detector generates a warning
*/
sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);

For exmaple, define and use the following in both situations:

#define NUM_SAMPLE_PERIODS 5

> +enum stats_per_group {
> + STATS_SYSTEM,
> + STATS_SOFTIRQ,
> + STATS_HARDIRQ,
> + STATS_IDLE,
> + NUM_STATS_PER_GROUP,
> +};
> +
> +static const enum cpu_usage_stat tracked_stats[NUM_STATS_PER_GROUP] = {
> + CPUTIME_SYSTEM,
> + CPUTIME_SOFTIRQ,
> + CPUTIME_IRQ,
> + CPUTIME_IDLE,
> +};
> +
> +static DEFINE_PER_CPU(u16, cpustat_old[NUM_STATS_PER_GROUP]);
> +static DEFINE_PER_CPU(u8, cpustat_util[NUM_STATS_GROUPS][NUM_STATS_PER_GROUP]);
> +static DEFINE_PER_CPU(u8, cpustat_tail);
> +
> +/*
> + * We don't need nanosecond resolution. A granularity of 16ms is
> + * sufficient for our precision, allowing us to use u16 to store
> + * cpustats, which will roll over roughly every ~1000 seconds.
> + * 2^24 ~= 16 * 10^6
> + */
> +static u16 get_16bit_precision(u64 data_ns)
> +{
> + return data_ns >> 24LL; /* 2^24ns ~= 16.8ms */

I would personally use

delta_ns >> 20 /* 2^20ns ~= 1ms */

to make it easier for debugging by a human. It would support
the sample period up to 65s which might be enough.

But I do not resirt on it. ">> 24" provides less granularity
but it supports longer sample periods.

> +static void print_cpustat(void)
> +{
> + int i, group;
> + u8 tail = __this_cpu_read(cpustat_tail);
> + u64 sample_period_second = sample_period;
> +
> + do_div(sample_period_second, NSEC_PER_SEC);
> + /*
> + * We do not want the "watchdog: " prefix on every line,
> + * hence we use "printk" instead of "pr_crit".
> + */
> + printk(KERN_CRIT "CPU#%d Utilization every %llus during lockup:\n",
> + smp_processor_id(), sample_period_second);
> + for (i = 0; i < NUM_STATS_GROUPS; i++) {

This starts with the 1st group in the array. Is it the oldest one?
It should take into account cpustat_tail.


> + group = (tail + i) % NUM_STATS_GROUPS;
> + printk(KERN_CRIT "\t#%d: %3u%% system,\t%3u%% softirq,\t"
> + "%3u%% hardirq,\t%3u%% idle\n", i + 1,
> + __this_cpu_read(cpustat_util[group][STATS_SYSTEM]),
> + __this_cpu_read(cpustat_util[group][STATS_SOFTIRQ]),
> + __this_cpu_read(cpustat_util[group][STATS_HARDIRQ]),
> + __this_cpu_read(cpustat_util[group][STATS_IDLE]));
> + }
> +}
> +

Best Regards,
Petr