[PATCH 0/7] introduce cpu.headroom knob to cpu controller

From: Song Liu
Date: Mon Apr 08 2019 - 17:46:01 EST

Next message: Song Liu: "[PATCH 1/7] sched: refactor tg_set_cfs_bandwidth()"
Previous message: Carlos O'Donell: "Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)"
Next in thread: Song Liu: "[PATCH 1/7] sched: refactor tg_set_cfs_bandwidth()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Servers running latency sensitive workload usually aren't fully loaded for
various reasons including disaster readiness. The machines running our
interactive workloads (referred as main workload) have a lot of spare CPU
cycles that we would like to use for optimistic side jobs like video
encoding. However, our experiments show that the side workload has strong
impact on the latency of main workload:

side-job main-load-level main-avg-latency
none 1.0 1.00
none 1.1 1.10
none 1.2 1.10
none 1.3 1.10
none 1.4 1.15
none 1.5 1.24
none 1.6 1.74

ffmpeg 1.0 1.82
ffmpeg 1.1 2.74

Note: both the main-load-level and the main-avg-latency numbers are
_normalized_.

In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1
(lowest priority). However, it consumes all idle CPU cycles in the
system and causes high latency for the main workload. Further experiments
and analysis (more details below) shows that, for the main workload to meet
its latency targets, it is necessary to limit the CPU usage of the side
workload so that there are some _idle_ CPU. There are various reasons
behind the need of idle CPU time. First, shared CPU resouce saturation
starts to happen way before time-measured utilization reaches 100%.
Secondly, scheduling latency starts to impact the main workload as CPU
reaches full utilization.

Currently, the cpu controller provides two mechanisms to protect the main
workload: cpu.weight and cpu.max. However, neither of them is sufficient
in these use cases. As shown in the experiments above, side workload with
cpu.weight of 1 (lowest priority) would still consume all idle CPU and add
unacceptable latency to the main workload. cpu.max can throttle the CPU
usage of the side workload and preserve some idle CPU. However, cpu.max
cannot react to changes in load levels. For example, when the main
workload uses 40% of CPU, cpu.max of 30% for the side workload would yield
good latencies for the main workload. However, when the workload
experiences higher load levels and uses more CPU, the same setting (cpu.max
of 30%) would cause the interactive workload to miss its latency target.

These experiments demonstrated the need for a mechanism to effectively
throttle CPU usage of the side workload and preserve idle CPU cycles.
The mechanism should be able to adjust the level of throttling based on
the load level of the main workload.

This patchset introduces a new knob for cpu controller: cpu.headroom.
cgroup of the main workload uses cpu.headroom to ensure side workload to
use limited CPU cycles. For example, if a main workload has a cpu.headroom
of 30%. The side workload will be throttled to give 30% overall idle CPU.
If the main workload uses more than 70% of CPU, the side workload will only
run with configurable minimal cycles. This configurable minimal cycles is
referred as "tolerance" of the main workload.

The following is a detailed example:

main/cpu.headroom main-cpu-load low-pri-cpu-cycle idle-cpu
30% 30% 40% 30%
30% 40% 30% 30%
30% 50% 20% 30%
30% 60% 10% 30%
30% 70% minimal ~30%
30% 80% minimal ~20%

In the example, we use a constant cpu.headroom setting of 30%. As main job
experiences different level of load, the cpu controller adjusts CPU cycles
used by the low-pri jobs.

We experiemented with a web server as the main workload and ffmpeg as the
side workload. The following table compares latency impact on the main
workload under different cpu.headroom settings and load levels. In all
tests, the side workload cgroup is configured with cpu.weight of 1. When
throttled, the side workload can only run 1ms per 100ms period.

average-latency
main-load-level w/o-side w/-side- w/-side- w/-side-
no-headroom 30%-headroom 20%-headroom
1.0 1.00 1.82 1.26 1.14
1.1 1.10 2.74 1.26 1.32
1.2 1.10 1.29 1.38
1.3 1.10 1.32 1.49
1.4 1.15 1.29 1.85
1.5 1.24 1.32
1.6 1.74 1.50

Each row of the table shows a normalized load level and average latencies
for 4 scenarios: w/o side workload, w/ side workload but no headroom; w/
side workload and 30% headroom; with side workload and 20% headroom.

When there is no side workload, average latency of main job falls in the
0.7x range, except the very high load scenarios. When there is side
workload but no headroom, latency of the main job goes very high at
moderate load levels. With 30% headroom, the average latency falls in the
0.8x range. With 20% headroom, the average latency falls in the 0.9x to
1.x range. We didn't finish tests in some cases with high load, because
the latency is too high.

This experiment demonstrated cpu.headroom is an effective and efficient
knob to control the latency of the main job.

Thanks!

Song Liu (7):
sched: refactor tg_set_cfs_bandwidth()
cgroup: introduce hook css_has_tasks_changed
cgroup: introduce cgroup_parse_percentage
sched, cgroup: add entry cpu.headroom
sched/fair: global idleness counter for cpu.headroom
sched/fair: throttle task runtime based on cpu.headroom
Documentation: cgroup-v2: add information for cpu.headroom

Documentation/admin-guide/cgroup-v2.rst | 18 +
fs/proc/stat.c | 4 +-
include/linux/cgroup-defs.h | 2 +
include/linux/cgroup.h | 1 +
include/linux/kernel_stat.h | 2 +
kernel/cgroup/cgroup.c | 51 +++
kernel/sched/core.c | 425 ++++++++++++++++++++++--
kernel/sched/fair.c | 143 +++++++-
kernel/sched/sched.h | 30 ++
9 files changed, 634 insertions(+), 42 deletions(-)

--
2.17.1

Next message: Song Liu: "[PATCH 1/7] sched: refactor tg_set_cfs_bandwidth()"
Previous message: Carlos O'Donell: "Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)"
Next in thread: Song Liu: "[PATCH 1/7] sched: refactor tg_set_cfs_bandwidth()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]