Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

From: Atish Patra
Date: Wed Nov 22 2017 - 00:25:32 EST


Here are the results of schbench(scheduler latency benchmark) and uperf (networking benchmark).

Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k

Overall, both benchmark seems to happiest when number of threads are closer to number of cpus.
--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
Base(4.14) Base+pcpu
Num
Worker Mean Stdev Mean Stdev Improvement (%)
10 3026.8 4987.12 523 474.35 82.7210255055
18 13854.6 1841.61 12945.6 125.19 6.5609977913
19 16457 2046.51 12985.4 48.46 21.0949747828
20 14995 2368.84 15838 2038.82 -5.621873958
25 29952.2 107.72 29673.6 337.57 0.9301487036
30 30084 19.768 30096.2 7.782 -0.0405531179

-------------------------------------------------------------------------------------------------------------------

The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.

If number of threads are equal or higher than number of cpus, it results in significantly
higher latencies in because of the nature of the benchmark. Results for higher
threads use case are presented to provide a complete picture but it is difficult to conclude
anything from that.

Next individual percentile results are present for each use case. The proposed fix also
improves latency across all percentiles for configuration(19 worker threads) which
should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
Baseline(4.14) Base+pcpu
Num
Worker Mean stdev Mean stdev Improvement(%)

50th
10 64.2 2.039 63.6 1.743 0.934
18 57.6 5.388 57 4.939 1.041
19 63 4.774 58 4 7.936
20 59.6 4.127 60.2 5.153 -1.006
25 78.4 0.489 78.2 0.748 0.255
30 96.2 0.748 96.4 1.019 -0.207

75th
10 72 3.033 71.6 2.939 0.555
18 78 2.097 77.2 2.135 1.025
19 81.6 1.2 79.4 0.8 2.696
20 81 1.264 80.4 2.332 0.740
25 109.6 1.019 110 0 -0.364
30 781.4 50.902 731.8 70.6382 6.3475

90th
10 80.4 3.666 80.6 2.576 -0.248
18 87.8 1.469 88 1.673 -0.227
19 92.8 0.979 90.6 0.489 2.370
20 92.6 1.019 92 2 0.647
25 8977.6 1277.160 9014.4 467.857 -0.409
30 9558.4 334.641 9507.2 320.383 0.5356

95th
10 86.8 3.867 87.6 4.409 -0.921
18 95.4 1.496 95.2 2.039 0.209
19 102.6 1.624 99 0.894 3.508
20 103.2 1.326 102.2 2.481 0.968
25 12400 78.383 12406.4 37.318 -0.051
30 12336 40.477 12310.4 12.8 0.207

99th
10 99.2 5.418 103.4 6.887 -4.233
18 115.2 2.561 114.6 3.611 0.5208
19 126.25 4.573 120.4 3.872 4.6336
20 145.4 3.09 133 1.41 8.5281
25 12988.8 15.676 12924.8 25.6 0.4927
30 12988.8 15.676 12956.8 32.633 0.2463

99.50th
10 104.4 5.161 109.8 7.909 -5.172
18 127.6 7.391 124.2 4.214 2.6645
19 2712.2 4772.883 133.6 5.571 95.074
20 3707.8 2831.954 2844.2 4708.345 23.291
25 14032 1283.834 13008 0 7.2976
30 16550.4 886.382 13840 1218.355 16.376
------------------------------------------------------------------------------------------------------------------------

Results from uperf
uperf config: Loopback ping pong test with message size = 8k

Baseline (4.14) Baseline +pcpu
Mean stdev Mean stdev Improvement(%)
1 9.056 0.02 8.966 0.083 -0.993
2 17.664 0.13 17.448 0.303 -1.222
4 32.03 0.22 31.972 0.129 -0.181
8 58.198 0.31 58.588 0.198 0.670
16 101.018 0.67 100.056 0.455 -0.952
32 148.1 15.41 164.494 2.312 11.069
64 203.66 1.16 203.042 1.348 -0.3073
128 197.12 1.04 194.722 1.174 -1.2165

The race window fix seems to help uperf for 32 threads (closest to number of cpus) as well.

Regards,
Atish

On 11/04/2017 07:58 PM, Joel Fernandes wrote:
Hi Peter,

On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.
The very most important question; does it actually help? What
benchmarks, give what numbers?
I collected some numbers with an Android benchmark called Jankbench.
Most tests didn't show an improvement or degradation with the patch.
However, one of the tests called "list view", consistently shows an
improvement. Particularly striking is the improvement at mean and 25
percentile.

For list_view test, Jankbench pulls up a list of text and scrolls the
list, this exercises the display pipeline in Android to render and
display the animation as the scroll happens. For Android, lower frame
times is considered quite important as that means we are less likely
to drop frames and give the user a good experience vs a perceivable
poor experience.

For each frame, Jankbench measures the total time a frame takes and
stores it in a DB (the time from which the app starts drawing, to when
the rendering completes and the frame is submitted for display).
Following is the distribution of frame times in ms.

count 16304 (@60 fps, 4.5 minutes)

Without patch With patch
mean 5.196633 4.429641 (+14.75%)
std 2.030054 2.310025
25% 5.606810 1.991017 (+64.48%)
50% 5.824013 5.716631 (+1.84%)
75% 5.987102 5.932751 (+0.90%)
95% 6.461230 6.301318 (+2.47%)
99% 9.828959 9.697076 (+1.34%)

Note that although Android uses energy aware scheduling patches, I
turned those off to bring the test as close to mainline as possible. I
also backported Vincent's and Brendan's slow path fixes to the 4.4
kernel that the Pixel 2 uses.

Personally I am in favor of this patch considering this test data but
also that in the past, I remember that our teams had to deal with the
same race issue and used cpusets to avoid it (although they probably
tested with "energy aware" CPU selection kept on).

thanks,

- Joel