Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

From: Morten Rasmussen
Date: Mon Oct 09 2017 - 04:09:14 EST


On Fri, Sep 01, 2017 at 03:21:11PM +0200, Peter Zijlstra wrote:
> When an entity migrates in (or out) of a runqueue, we need to add (or
> remove) its contribution from the entire PELT hierarchy, because even
> non-runnable entities are included in the load average sums.
>
> In order to do this we have some propagation logic that updates the
> PELT tree, however the way it 'propagates' the runnable (or load)
> change is (more or less):
>
> tg->weight * grq->avg.load_avg
> ge->avg.load_avg = ------------------------------
> tg->load_avg
>
> But that is the expression for ge->weight, and per the definition of
> load_avg:
>
> ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

You may want to replace "ge->weight" by "ge->load.weight" to be consistent
with the code comments introduced by the patch.

> That destroys the runnable_avg (by setting it to 1) we wanted to
> propagate.
>
> Instead directly propagate runnable_sum.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/sched/debug.c | 2
> kernel/sched/fair.c | 186 ++++++++++++++++++++++++++++-----------------------
> kernel/sched/sched.h | 9 +-
> 3 files changed, 112 insertions(+), 85 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
> cfs_rq->removed.load_avg);
> SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
> cfs_rq->removed.util_avg);
> + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum",
> + cfs_rq->removed.runnable_sum);
> #ifdef CONFIG_FAIR_GROUP_SCHED
> SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
> cfs_rq->tg_load_avg_contrib);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
> se->avg.last_update_time = n_last_update_time;
> }
>
> -/* Take into account change of utilization of a child task group */
> +
> +/*
> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> + * propagate its contribution. The key to this propagation is the invariant
> + * that for each group:
> + *
> + * ge->avg == grq->avg (1)
> + *
> + * _IFF_ we look at the pure running and runnable sums. Because they
> + * represent the very same entity, just at different points in the hierarchy.
> + *
> + *
> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> + * simply copies the running sum over.
> + *
> + * However, update_tg_cfs_runnable() is more complex. So we have:
> + *
> + * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2)
> + *
> + * And since, like util, the runnable part should be directly transferable,
> + * the following would _appear_ to be the straight forward approach:
> + *
> + * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3)

Should it be grq->avg.runnable_avg instead of running_avg?

cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to
be:

load_avg = \Sum se->avg.load_avg
= \Sum se->load.weight * se->avg.runnable_avg

That sum will increase when ge is runnable regardless of whether it is
running or not. So, I think it has to be runnable_avg to make sense?

> + *
> + * And per (1) we have:
> + *
> + * ge->avg.running_avg == grq->avg.running_avg

You just said further up that (1) only applies to running and runnable
sums? These are averages, so I think this is invalid use of (1). But
maybe that is part of your point about (4) being wrong?

I'm still trying to get my head around the remaining bits, but it sort
of depends if I understood the above bits correctly :)

Morten

> + *
> + * Which gives:
> + *
> + * ge->load.weight * grq->avg.load_avg
> + * ge->avg.load_avg = ----------------------------------- (4)
> + * grq->load.weight
> + *
> + * Except that is wrong!
> + *
> + * Because while for entities historical weight is not important and we
> + * really only care about our future and therefore can consider a pure
> + * runnable sum, runqueues can NOT do this.
> + *
> + * We specifically want runqueues to have a load_avg that includes
> + * historical weights. Those represent the blocked load, the load we expect
> + * to (shortly) return to us. This only works by keeping the weights as
> + * integral part of the sum. We therefore cannot decompose as per (3).
> + *
> + * OK, so what then?
> + *
> + *
> + * Another way to look at things is:
> + *
> + * grq->avg.load_avg = \Sum se->avg.load_avg
> + *
> + * Therefore, per (2):
> + *
> + * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + *
> + * And the very thing we're propagating is a change in that sum (someone
> + * joined/left). So we can easily know the runnable change, which would be, per
> + * (2) the already tracked se->load_avg divided by the corresponding
> + * se->weight.
> + *
> + * Basically (4) but in differential form:
> + *
> + * d(runnable_avg) += se->avg.load_avg / se->load.weight
> + * (5)
> + * ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> + */
> +