[PATCH V2 0/3] Introduce Thermal Pressure

From: Thara Gopinath
Date: Tue Apr 16 2019 - 15:38:47 EST


Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. To this extent, this
patch series defines a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further, aobench (An occlusion renderer for benchmarking realworld
floating point performance), dhrystone and hackbench test have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
cpu cooling was disabled on little core, the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.
Finally, this patch series has been boot tested on db410C running v5.1-rc4
kernel.

During the course of development various methods of capturing
and reflecting thermal pressure were implemented.

The first method to be evaluated was to convert the
capped max frequency into capacity and have the scheduler use the
instantaneous value when updating cpu_capacity.
This method is referenced as "Instantaneous Thermal Pressure" in the
test results below.

The next two methods employs different methods of averaging the
thermal pressure before applying it when updating cpu_capacity.
The first of these methods re-used the PELT algorithm already present
in the kernel that does the averaging of rt and dl load and utilization.
This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
in the test results below.

The final method employs an averaging algorithm that collects and
decays thermal pressure based on the decay period. In this method,
the decay period is configurable. This method is referenced as
"Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
test results below.

The test results below shows 3-5% improvement in performance when
using the third solution compared to the default system today where
scheduler is unware of cpu capacity limitations due to thermal events.


Hackbench: (1 group , 30000 loops, 10 runs)
Result Standard Deviation
(Time Secs) (% of mean)

No Thermal Pressure 10.21 7.99%

Instantaneous thermal pressure 10.16 5.36%

Thermal Pressure Averaging
using PELT fmwk 9.88 3.94%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 9.94 4.59%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 7.52 5.42%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 9.87 3.94%



Aobench: Size 2048 * 2048
Result Standard Deviation
(Time Secs) (% of mean)

No Thermal Pressure 141.58 15.85%

Instantaneous thermal pressure 141.63 15.03%

Thermal Pressure Averaging
using PELT fmwk 134.48 13.16%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 133.62 13.00%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 137.22 15.30%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 137.55 13.26%

Dhrystone was run 10 times with each run spawning 20 threads of
500 MLOOPS.The idea here is to measure the Total dhrystone run
time and not look at individual processor performance.

Dhrystone Run Time
Result Standard Deviation
(Time Secs) (% of mean)

No Thermal Pressure 1.14 10.04%

Instantaneous thermal pressure 1.15 9%

Thermal Pressure Averaging
using PELT fmwk 1.19 11.60%

Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 1.09 7.51%

Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 1.012 7.02%

Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 1.12 9.02%

V1->V2: Removed using Pelt framework for thermal pressure accumulation
and averaging. Instead implemented a weighted average algorithm.

Thara Gopinath (3):
Calculate Thermal Pressure
sched/fair: update cpu_capcity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping

drivers/thermal/cpu_cooling.c | 4 +
include/linux/sched/thermal.h | 11 +++
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 4 +
kernel/sched/thermal.c | 220 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 240 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sched/thermal.h
create mode 100644 kernel/sched/thermal.c
--
2.1.4