Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime

From: Dietmar Eggemann
Date: Tue Feb 07 2023 - 05:30:00 EST


On 09/11/2022 16:49, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

> So one thing that was key to that hack I proposed is that it is
> per-task. This means we can either set or detect the task activation
> period and use that to select an appropriate PELT multiplier.
>
> But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> forbit higher), the utilization should be the same between the various
> PELT window sizes, provided the activation period isn't *much* larger
> than the window.
>
> Are these things running a ton of single shot tasks or something daft
> like that?

This investigation tries to answer these questions. The results can
be found in chapter (B) and (C).

I ran 'util_est_faster' with delta equal to 'duration of the current
activation'. I.e. the following patch is needed:

https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@xxxxxxxxx

The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity
= [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and
forward ported task scheduler patches.

(A) *** 'util_est_faster' vs. 'scaled util_est_faster' ***

The initial approach didn't scale the runtime duration. It is based
on task clock and not PELT clock but it should be scaled by uArch
and frequency to align with the PELT time used for util tracking.

Although the original approach shows better results than the scaled
one. Even more aggressive boosting on non-big CPUs helps to raise the
frequency even quicker in the scenario described under (B).

All tests ran 10 iterations of all Jankbench sub-tests.

Max_frame_duration:
+------------------------+------------+
| kernel | value |
+------------------------+------------+
| base-a30b17f016b0 | 147.571352 |
| util_est_faster | 84.834999 |
| scaled_util_est_faster | 127.72855 |
+------------------------+------------+

Mean_frame_duration:
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 14.7 | 0.0% |
| util_est_faster | 12.6 | -14.01% |
| scaled_util_est_faster | 13.5 | -8.45% |
+------------------------+-------------------+

Jank percentage (Jank deadline 16ms):
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 1.8 | 0.0% |
| util_est_faster | 0.8 | -57.8% |
| scaled_util_est_faster | 1.4 | -25.89% |
+------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 144.4 | 0.0% |
| util_est_faster | 150.9 | 4.45% |
| scaled_util_est_faster | 152.2 | 5.4% |
+------------------------+-------+-----------+

'scaled util_est_faster' is used as the base for all following tests.

(B) *** Where does util_est_faster help exactly? ***

It turns out that the score improvement comes from the more aggressive
DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
-> effective_cpu_util(..., cpu_util_cfs(), ...).

At the beginning of an episode (e.g. beginning of an image list view
fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
frequency') of the Android Graphics Pipeline (AGP) start to run, the
CPU Operating Performance Point (OPP) is often so low that those tasks
run more like 10/16ms which let the test application count a lot of
Jankframes at those moments.

And there is where this util_est_faster approach helps by boosting CPU
util according to the 'runtime of the current activation'.
Moreover it could also be that the tasks have simply more work to do in
these first activations at the beginning of an episode.

All the other places in which cpu_util_cfs() is used:

(2) CFS load balance ('_lb')
(3) CPU overutilization ('_ou')
(4) CFS fork/exec task placement ('_slowpath')

when tested individually don't show any improvement or even regression.

Max_frame_duration:
+---------------------------------+------------+
| kernel | value |
+---------------------------------+------------+
| scaled_util_est_faster | 127.72855 |
| scaled_util_est_faster_freq | 126.646506 |
| scaled_util_est_faster_lb | 162.596249 |
| scaled_util_est_faster_ou | 166.59519 |
| scaled_util_est_faster_slowpath | 153.966638 |
+---------------------------------+------------+

Mean_frame_duration:
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 13.5 | 0.0% |
| scaled_util_est_faster_freq | 13.7 | 1.79% |
| scaled_util_est_faster_lb | 14.8 | 9.87% |
| scaled_util_est_faster_ou | 14.5 | 7.46% |
| scaled_util_est_faster_slowpath | 16.2 | 20.45% |
+---------------------------------+-------+-----------+

Jank percentage (Jank deadline 16ms):
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 1.4 | 0.0% |
| scaled_util_est_faster_freq | 1.3 | -2.34% |
| scaled_util_est_faster_lb | 1.7 | 27.42% |
| scaled_util_est_faster_ou | 2.1 | 50.33% |
| scaled_util_est_faster_slowpath | 2.8 | 102.39% |
+---------------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 152.2 | 0.0% |
| scaled_util_est_faster_freq | 132.3 | -13.1% |
| scaled_util_est_faster_lb | 137.1 | -9.96% |
| scaled_util_est_faster_ou | 132.4 | -13.04% |
| scaled_util_est_faster_slowpath | 141.3 | -7.18% |
+---------------------------------+-------+-----------+

(C) *** Which tasks contribute the most to the score improvement? ***

A trace_event capturing the cases in which task's util_est_fast trumps
CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench
and the base is (1) 'scaled_util_est_faster_freq':

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb

'Cell [6]' shows the tasks of the Jankbench process
'[com.an]droid.benchmark' which are boosting the CPU frequency request.

Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and
'RenderThread'.
The spikes in util_est_fast are congruent with the aforementioned
beginning of an episode in which these periodic tasks are running and
when their runtime/period is rather ~10/16ms and not ~1-2/16ms since
the CPU OPP is still low.

Very few other Jankbench tasks 'Cell [6] show the same behaviour. The
Surfaceflinger process 'Cell [8]' is not affected and from the kernel
tasks only kcompctd0 creates a mild boost 'Cell [9]'.

As expected, running a non-scaled version of (1) shows more aggressive
boosting on non-big CPUs:

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb

Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
util when periodic tasks have a longer runtime compared to when they reach
steady-sate.

The results is very similar to PELT halflife reduction. The advantage is
that 'util_est_faster' is only activated selectively when the runtime of
the current task in its current activation is long enough to create this
CPU util boost.

Original patch:
https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Changes applied:
- use 'duration of the current activation' as delta
- delta >>= 10
- uArch and frequency scaling of delta

-->%--

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index efdc29c42161..76d146d06bbe 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true)
*/
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_EST_FASTER, true)

SCHED_FEAT(LATENCY_WARN, false)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..13cd9e27ce3e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
return periods;
}

+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+ unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+ u64 periods = delta / 1024;
+
+ if (periods) {
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods, 1024, delta);
+ }
+
+ return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1072502976df..7cb45f1d8062 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
return READ_ONCE(rq->avg_dl.util_avg);
}

+extern unsigned long faster_est_approx(u64 runtime);
+
/**
* cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
* @cpu: the CPU to get the utilization for.
@@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
*/
static inline unsigned long cpu_util_cfs(int cpu)
{
+ struct rq *rq = cpu_rq(cpu);
struct cfs_rq *cfs_rq;
unsigned long util;

- cfs_rq = &cpu_rq(cpu)->cfs;
+ cfs_rq = &rq->cfs;
util = READ_ONCE(cfs_rq->avg.util_avg);

if (sched_feat(UTIL_EST)) {
+ if (sched_feat(UTIL_EST_FASTER)) {
+ struct task_struct *curr;
+
+ rcu_read_lock();
+ curr = rcu_dereference(rq->curr);
+ if (likely(curr->sched_class == &fair_sched_class)) {
+ unsigned long util_est_fast;
+ u64 delta;
+
+ delta = curr->se.sum_exec_runtime -
+ curr->se.prev_sum_exec_runtime_vol;
+
+ delta >>= 10;
+ if (!delta)
+ goto unlock;
+
+ delta = cap_scale(delta, arch_scale_cpu_capacity(cpu));
+ delta = cap_scale(delta, arch_scale_freq_capacity(cpu));
+
+ util_est_fast = faster_est_approx(delta * 2);
+ util = max(util, util_est_fast);
+ }
+unlock:
+ rcu_read_unlock();
+ }
util = max_t(unsigned long, util,
READ_ONCE(cfs_rq->avg.util_est.enqueued));
}