[PATCH v5 0/4] Utilization estimation (util_est) for FAIR tasks

From: Patrick Bellasi
Date: Thu Feb 22 2018 - 12:02:21 EST


Hi, here is an update of [1], based on today's tip/sched/core [2], which
targets two main things:

1) Add the required/missing {READ,WRITE}_ONCE compiler barriers

AFAIU, we wanted those barriers for lock-less synchronization between:
- enqueue/dequeue calls: which are serialized by the RQ lock but where we
read/modify util_est signals which can be _read concurrently_ from other
code paths.
- load balancer related functions: which are not serialized by the RQ lock
but read util_est signals updated by the enqueue/dequeue calls.

However, after noticing this commit:

7bd3e239d6c6 locking: Remove atomicy checks from {READ,WRITE}_ONCE

I'm still a bit confused on the real need of these calls mainly because:
a) they are not the proper mechanism to grant atomic load/stores, for
example when we need to access u64 values while running on a 32bit
target.
b) apart from possible load/store tearing issues, I was not able to see
other scenarios, among the ones described in [3] which potentially
apply to the code of these patches.

Thus, to avoid load/store tearing, in principle I would have used:
- WRITE_ONCE only on RQ-locked serialized code.
Where we read/modify util_est signals, in order to properly publish it to
concurrently running load balancer code.
- READ_ONCE only from _non_ RQ-lock serialized code.
Where we read only util_est signals.

To my understanding this should be just good enough also to document the
concurrent access to some shared variables while still allowing the compiler
to optimize some load from the RQ-lock serialized code.

All that considered, my last question on this point is: can we remove the
READ_ONCE()s from RQ-lock serialized code?

2) Ensure the feature can be safely turned on by default

Estimated utilization of Tasks and RQs is a feature which can benefit mainly
lightly utilized systems, where you can have tasks which sleep for
relatively long time but we still want to be fast on ramping-up the OPPs
once they wake up.

However, since Peter proposed to have this scheduler feature turned on by
default, we did spent a bit more time focusing on hackbench to verify it
will not hurt server/HPC classes of workloads.

The main discovery has been that, if we properly configure hackbench
to have an high rate of enqueue/dequeue events, despite the few instructions
util_est adds, the overheads starts to become more noticeable.
That's the reason of the last patch we added to this series, which changelog
should be good enough to describe the issue and the proposed solution.

Experiments including this patch have been run on a dual socket
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, using precisely this
configuration:

- cpusets to isolate one single socket for the execution of hackbench on
just 10 of the available 20 cores.
This allows to avoid NUMA load balancer side effects which we noticed
affect quite a lot the variance across multiple experiments.

- CPUFreq powersave policy, with the intel_pstate driver configured in
passive mode and the scaling_max_freq set to the scaling_min_freq.
This allows to rule out thermal and/or turbo boost side effects.

- hackbench configured to run 120 iterations of:

perf bench sched messaging --pipe --thread --group 8 --loop 5000

which, in the above setup, corresponds to ~11s completion time for each
iteration using 320 tasks.

In the above setup, this configuration seems to maximize the rate of
wakeup/sleep events thus better stressing the enqueue/dequeue code paths.
Here are the stats we collected for the completion times:

count mean std min 50% 95% 99% max
before 120.0 11.010342 0.375753 10.104 11.046 11.54155 11.69629 11.751
after 120.0 11.041117 0.375429 10.015 11.072 11.59070 11.67720 11.692

after vs before: +0.3% on mean
after vs before: -0.2% on 99% percentile

Results on ARM (Android) devices have been collected and reported in a previous
posting [4] and they showed negligible overhead compared to the corresponding
power/performance benefits.

Changes in v5:
- rebased on today's tip/sched/core (commit 083c6eeab2cc, based on v4.16-rc2)
- update util_est only on util_avg updates
- add documentation for "struct util_est"
- always use int instead of long whenever possible (Peter)
- pass cfs_rq to util_est_{en,de}queue (Peter)
- pass task_sleep to util_est_dequeue
- use singe WRITE_ONCE at dequeue time
- add some missing {READ,WRITE}_ONCE
- add task_util_est() for code consistency

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)
- simplify cpu_util_cfs() integration (Dietmar)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
- schedutil integration using Juri's cpu_util_cfs()
- first patch dropped since it's already queued in tip/sched/core

Changes in v2:
- rebased on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2018/2/6/356
20180206144131.31233-1-patrick.bellasi@xxxxxxx
[2] git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
sched/core (commit 083c6eeab2cc)
[3] Documentation/memory-barriers.txt (Line 1508)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/memory-barriers.txt?h=v4.16-rc2#n1508
[4] https://lkml.org/lkml/2018/1/23/645
20180123180847.4477-1-patrick.bellasi@xxxxxxx

Patrick Bellasi (4):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection
sched/fair: update util_est only on util_avg updates

include/linux/sched.h | 29 +++++++
kernel/sched/debug.c | 4 +
kernel/sched/fair.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 5 ++
kernel/sched/sched.h | 7 +-
5 files changed, 259 insertions(+), 7 deletions(-)

--
2.15.1