Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec -14.4% regression

From: Xing Zhengjun
Date: Thu Apr 22 2021 - 22:16:04 EST

Next message: HORIGUCHI NAOYA(堀口　直也): "Re: [PATCH v3 2/3] mm,hwpoison: return -EHWPOISON when page already"
Previous message: Siddharth Gupta: "Re: [PATCH] remoteproc: core: Move cdev add before device add"
In reply to: Paul E. McKenney: "Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec -14.4% regression"
Next in thread: Paul E. McKenney: "Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec -14.4% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/22/2021 10:24 PM, Paul E. McKenney wrote:

On Thu, Apr 22, 2021 at 03:41:26PM +0800, Feng Tang wrote:

Hi Paul,

On Thu, Apr 22, 2021 at 02:58:27PM +0800, Xing Zhengjun wrote:

On 4/21/2021 9:42 PM, Paul E. McKenney wrote:

On Wed, Apr 21, 2021 at 02:07:19PM +0800, Xing, Zhengjun wrote:

On 4/20/2021 10:05 PM, Paul E. McKenney wrote:

On Tue, Apr 20, 2021 at 06:43:31AM -0700, Paul E. McKenney wrote:

On Tue, Apr 20, 2021 at 02:49:34PM +0800, kernel test robot wrote:

Greeting,

FYI, we noticed a -14.4% regression of stress-ng.opcode.ops_per_sec due to commit:

commit: 6c52b5f3cfefd6e429efc4413fd25e3c394e959f ("clocksource: Reduce WATCHDOG_THRESHOLD")
https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git dev.2021.04.13a

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 60s
fs: ext4
class: os
test: opcode
cpufreq_governor: performance
ucode: 0x5003006

Hmmm... I will try a less-aggressive reduction. Thank you for testing!

But wait... This code is only running twice per second. It is very
hard to believe that a clock-read retry twice per second is worth 2% of
performance, let alone 14.4%.

Is something else perhaps going on here?

For example, did this run enable any of the new diagnositic clocksource.*
kernel parameters?

Thanx, Paul

I attached the kernel log, the following logs are related with the
clocksource.
[    3.453206] clocksource: timekeeping watchdog on CPU1: Marking
clocksource 'tsc-early' as unstable because the skew is too large:
[    3.455197] clocksource:                       'hpet' wd_now: 288fcc0
wd_last: 1a8b333 mask: ffffffff
[    3.455199] clocksource:                       'tsc-early' cs_now:
1def309ebfdee cs_last: 1def2bd70d92c mask: ffffffffffffffff
[    3.455201] clocksource:                       No current clocksource.
[    3.457197] tsc: Marking TSC unstable due to clocksource watchdog

6c52b5f3cf reduced WATCHDOG_THRESHOLD, then in clocksource_watchdog, the
warning logs are print, the TSC is marked as unstable.
/* Check the deviation from the watchdog clocksource. */

Aha, so this system really does have an unstable TSC! Which means that
the patch is operating as designed.

Or are you saying that this is a false positive?

Thanx, Paul

It happened during boot and before TSC calibration
(tsc_refine_calibration_work()), so on some machines "abs(cs_nsec - wd_nsec)

WATCHDOG_THRESHOLD", WATCHDOG_THRESHOLD is set too small at that time.

After TSC calibrated, abs(cs_nsec - wd_nsec) should be very small,
WATCHDOG_THRESHOLD for here is ok. So I suggest increasing the
WATCHDOG_THRESHOLD before TSC calibration, for example, the clocks be skewed
by more than 1% to be marked unstable.

This is common code, so we do need an architecture-independent way to
handle this.

As Zhengjun measuered, this is a Cascade Lake platform, and it has 2
times calibration of tsc, the first one of early quick calibration gives
2100 MHz, while the later accurate calibration gives 2095 MHz, so there
is about 2.5/1000 deviation for the first number, which just exceeds the
1/1000 threshold you set :)

Even my 2/1000 initial try would have caused this, then. ;-)

But even 1/1000 deviation would cause any number of applications some
severe heartburn, so I am not at all happy with the thought of globally
increasing to (say) 3/1000.

Following is the tsc freq info from kernel log

[ 0.000000] DMI: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[ 0.000000] tsc: Detected 2100.000 MHz processor
...
[ 13.859982] tsc: Refined TSC clocksource calibration: 2095.077 MHz

So what are our options?

1. Clear CLOCK_SOURCE_MUST_VERIFY from tsc-early.

2. #1, but add tsc-early into the watchdog list and set
CLOCK_SOURCE_MUST_VERIFY once it is better calibrated.

3. Add a field to struct clocksource that, if non-zero, gives
the maximum drift in nanoseconds per half second (AKA
WATCHDOG_INTERVAL). If zero, the WATCHDOG_MAX_SKEW value
is used. Set this to (say) 150,000ns for tsc-early.

4. As noted earlier, increase WATCHDOG_MAX_SKEW to 150 microseconds,
which again is not a good approach given the real-world needs
of real-world applications.

5. Your ideas here.

How about set two watchdog thresholds, one for before calibration(1/100), the other for after calibration(1/1000)? For example, u64 watchdog_thresholds[2].

All in all, I am glad that I made the patch that decreases
WATCHDOG_MAX_SKEW be separate and at the end of the series. ;-)

Thanx, Paul

--
Zhengjun Xing

Next message: HORIGUCHI NAOYA(堀口　直也): "Re: [PATCH v3 2/3] mm,hwpoison: return -EHWPOISON when page already"
Previous message: Siddharth Gupta: "Re: [PATCH] remoteproc: core: Move cdev add before device add"
In reply to: Paul E. McKenney: "Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec -14.4% regression"
Next in thread: Paul E. McKenney: "Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec -14.4% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]