Re: [PATCH] clocksource: Scale the max retry number of watchdog read according to CPU numbers

From: Paul E. McKenney
Date: Fri Jan 26 2024 - 07:27:29 EST


On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote:
> There was a bug on one 8-socket server that the TSC is wrongly marked as
> 'unstable' and disabled during boot time. (reproduce rate is every 120
> rounds of reboot tests), with log:
>
> clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
> wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
> clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
> clocksource: Switched to clocksource hpet
>
> The reason is for platform with lots of CPU, there are sporadic big or huge
> read latency of read watchog/clocksource during boot or when system is under
> stress work load, and the frequency and maximum value of the latency goes up
> with the increasing of CPU numbers. Current code already has logic to detect
> and filter such high latency case by reading 3 times of watchdog, and check
> the 2 deltas. Due to the randomness of the latency, there is a low possibility
> situation that the first delta (latency) is big, but the second delta is small
> and looks valid, which can escape from the check, and there is a
> 'max_cswd_read_retries' for retrying that check covering this case, whose
> default value is only 2 and may be not enough for machines with huge number
> of CPUs.
>
> So scale and enlarge the max retry number according to CPU number to better
> filter those latency noise on large system, which has been verified fine in
> 4 days and 670 rounds of reboot test on the 8-socket machine.
>
> Tested-by: Jin Wang <jin1.wang@xxxxxxxxx>
> Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> ---
> kernel/time/clocksource.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..f15283101906 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
> u64 wd_end, wd_end2, wd_delta;
> int64_t wd_delay, wd_seq_delay;
>
> + /*
> + * If no user changes the default value, scale the retry threshold
> + * according to CPU numbers. As per test, the more CPU a platform has,
> + * the bigger read latency is found during boot time or under stress
> + * work load. Increase the try nubmer to reduce false alarms.
> + */
> + if (max_cswd_read_retries == 2)

I like the self-adjusting behavior!

But why not make max_cswd_read_retries be a long (instead of a ulong)
defaulting to -1. Then when someone sets it to a specific positive
value, they get exactly that value, while leaving it -1 gets the default
CPU-scaling behavior. Zero and other negative values should get a
warning in order to reserve the for possible future use.

I also suggest doing the adjustment at boot time, for example, using
an early_initcall(). That way the test code also sees the scaled value.

Thanx, Paul

> + max_cswd_read_retries = max(2, ilog2(num_online_cpus()));
> +
> for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> local_irq_disable();
> *wdnow = watchdog->read(watchdog);
> --
> 2.34.1
>