Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources

From: Waiman Long
Date: Sun Nov 14 2021 - 22:25:41 EST



On 11/14/21 21:08, Feng Tang wrote:
Or did you have something else in mind?
I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
the delay between watchdog's (HPET here) read were not linear, that
from debug data, sometimes the 3-2 difference could be bigger or much
bigger than the 2-1 difference.

The reason could be the gap between 2 reads depends hugely on the system
pressure at that time that 3 HPET read happens. On our test box (a
2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
about 2.5 us, while under the stress it could be bumped to from 6 us
to 2800 us.

So I think checking the 3-2 difference plus increasing the max retries
to 10 may be a simple way, if the watchdog read is found to be
abnormally long, we skip this round of check.
On one of the test system, I had measured that normal delay
(hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
bootup time. However, the same system under stress could have a delay of
over 200us at bootup time. When I measured the consecutive hpet delay, it
was about 180us. So hpet read did dominate the total clocksource read delay.
Thank you both for the data!

I would not suggest increasing the max retries as it may still fail in most
cases because the system stress will likely not be going away within a short
time. So we are likely just wasting cpu times. I believe we should just skip
it if it is the watchdog read that is causing most of the delay.
If anything, adding that extra read would cause me to -reduce- the number
of retries to avoid increasing the per-watchdog overhead.
I understand Waiman's concern here, and in our test patch, the 2
consecutive watchdog read delay check is done inside this retrying
loop accompanying the 'cs' read, and once an abnormal delay is found,
the watchdog check is skipped without waiting for the max-retries to
complete.

Our test data shows the consecutive delay is not always big even when
the system is much stressed, that's why I suggest to increase the
retries.

If we need a large number of retries to avoid triggering the unstable TSC message, we should consider increase the threshod instead. Right?

That is why my patch 2 makes the max skew value a configurable option so that we can tune it if necessary.

Cheers,
Longman