Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources

From: Paul E. McKenney
Date: Mon Nov 15 2021 - 09:07:28 EST


On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> >
> > On 11/14/21 21:08, Feng Tang wrote:
> > > Or did you have something else in mind?
> > > > > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > bigger than the 2-1 difference.
> > > > > >
> > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > > > > to 2800 us.
> > > > > >
> > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > abnormally long, we skip this round of check.
> > > > > On one of the test system, I had measured that normal delay
> > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > bootup time. However, the same system under stress could have a delay of
> > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > Thank you both for the data!
> > > >
> > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > cases because the system stress will likely not be going away within a short
> > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > it if it is the watchdog read that is causing most of the delay.
> > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > of retries to avoid increasing the per-watchdog overhead.
> > > I understand Waiman's concern here, and in our test patch, the 2
> > > consecutive watchdog read delay check is done inside this retrying
> > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > the watchdog check is skipped without waiting for the max-retries to
> > > complete.
> > >
> > > Our test data shows the consecutive delay is not always big even when
> > > the system is much stressed, that's why I suggest to increase the
> > > retries.
> >
> > If we need a large number of retries to avoid triggering the unstable TSC
> > message, we should consider increase the threshod instead. Right?
> >
> > That is why my patch 2 makes the max skew value a configurable option so
> > that we can tune it if necessary.
>
> I'm fine with it, though the ideal case I expected is with carefully
> picked values for max_retries/screw_threshhold, we could save the users
> from configuring these. But given the complexity of all HWs out there,
> it's not an easy goal.

That is my goal as well, but I expect that more experience, testing,
and patches will be required to reach that goal.

> And I still suggest to put the consecutive watchdog read check inside
> the retry loop, so that it could bail out early when detecting the
> abnormal delay.

If the HPET read shows abnormal delay, agreed. But if the abnormal
delay is only in the clocksource under test (TSC in this case), then
a re-read seems to me to make sense.

> Another thing is we may need to set the 'watchdog_reset_pending', as
> under the stress, there could be consecutive many times of "skipping"
> watchdog check, and the saved value of 'cs' and 'watchdog' should be
> reset.

My thought was to count a read failure only if the HPET read did not
have excessive delays. This means that a cache-buster workload could
indefinitely delay a clock-skew check, which was one reason that I
was thinking in terms of using the actual measured delays to set the
clock-skew check criterion.

Either way, something like Waiman's patch checking the HPET delay looks
to me to be valuable.

Thoughts?

Thanx, Paul