Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

From: Vincent Guittot
Date: Tue Oct 31 2017 - 07:14:38 EST


On 30 October 2017 at 18:20, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> So after a bit of poking I ended up with something like the below; I
> think there's still a few open points, see XXX. But its better than we
> have now.
>
> Josef, could you see if this completely wrecks your workloads?
>
> ---
> Subject: sched: Update runnable propagation rule
> From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Date: Thu, 19 Oct 2017 17:04:42 +0200
>
> Unlike running, the runnable part can't be directly propagated through
> the hierarchy when we migrate a task. The main reason is that runnable
> time can be shared with other sched_entities that stay on the rq and
> this runnable time will also remain on prev cfs_rq and must not be
> removed.
>
> Instead, we can estimate what should be the new runnable of the prev
> cfs_rq and check that this estimation stay in a possible range. The
> prop_runnable_sum is a good estimation when adding runnable_sum but
> fails most often when we remove it. Instead, we could use the formula
> below instead:
>
> gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
>
> which assumes that tasks are equally runnable which is not true but
> easy to compute.
>
> Beside these estimates, we have several simple rules that help us to filter
> out wrong ones:
>
> - ge->avg.runnable_sum <= than LOAD_AVG_MAX
> - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
> - ge->avg.runnable_sum can't increase when we detach a task
>
> Cc: Yuyang Du <yuyang.du@xxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Chris Mason <clm@xxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Josef Bacik <josef@xxxxxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Link: http://lkml.kernel.org/r/20171019150442.GA25025@xxxxxxxxxx
> ---
>
> kernel/sched/fair.c | 99 ++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 70 insertions(+), 29 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3412,9 +3412,9 @@ void set_task_rq_fair(struct sched_entit
> * _IFF_ we look at the pure running and runnable sums. Because they
> * represent the very same entity, just at different points in the hierarchy.
> *
> - *
> - * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> - * simply copies the running sum over.
> + * Per the above update_tg_cfs_util() is trivial and simply copies the running
> + * sum over (but still wrong, because the group entity and group rq do not have
> + * their PELT windows aligned).
> *
> * However, update_tg_cfs_runnable() is more complex. So we have:
> *
> @@ -3423,11 +3423,11 @@ void set_task_rq_fair(struct sched_entit
> * And since, like util, the runnable part should be directly transferable,
> * the following would _appear_ to be the straight forward approach:
> *
> - * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3)
> + * grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg (3)
> *
> * And per (1) we have:
> *
> - * ge->avg.running_avg == grq->avg.running_avg
> + * ge->avg.runnable_avg == grq->avg.runnable_avg
> *
> * Which gives:
> *
> @@ -3446,27 +3446,28 @@ void set_task_rq_fair(struct sched_entit
> * to (shortly) return to us. This only works by keeping the weights as
> * integral part of the sum. We therefore cannot decompose as per (3).
> *
> - * OK, so what then?
> + * Another reason this doesn't work is that runnable isn't a 0-sum entity.
> + * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
> + * rq itself is runnable anywhere between 2/3 and 1 depending on how the
> + * runnable section of these tasks overlap (or not). If they were to perfectly
> + * align the rq as a whole would be runnable 2/3 of the time. If however we
> + * always have at least 1 runnable task, the rq as a whole is always runnable.
> *
> + * So we'll have to approximate.. :/
> *
> - * Another way to look at things is:
> + * Given the constraint:
> *
> - * grq->avg.load_avg = \Sum se->avg.load_avg
> + * ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
> *
> - * Therefore, per (2):
> + * We can construct a rule that adds runnable to a rq by assuming minimal
> + * overlap.
> *
> - * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + * On removal, we'll assume each task is equally runnable; which yields:
> *
> - * And the very thing we're propagating is a change in that sum (someone
> - * joined/left). So we can easily know the runnable change, which would be, per
> - * (2) the already tracked se->load_avg divided by the corresponding
> - * se->weight.
> + * grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
> *
> - * Basically (4) but in differential form:
> + * XXX: only do this for the part of runnable > running ?

That can be a possible improvement in how we are estimating the the runnable.
I'm going to make some trials to see the impact on the estimation

> *
> - * d(runnable_avg) += se->avg.load_avg / se->load.weight
> - * (5)
> - * ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> */
>
> static inline void
> @@ -3478,6 +3479,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
> if (!delta)
> return;
>
> + /*
> + * The relation between sum and avg is:
> + *
> + * LOAD_AVG_MAX - 1024 + sa->period_contrib
> + *
> + * however, the PELT windows are not aligned between grq and gse.
> + */
> +
> /* Set new sched_entity's utilization */
> se->avg.util_avg = gcfs_rq->avg.util_avg;
> se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> @@ -3490,33 +3499,65 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
> static inline void
> update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> {
> - long runnable_sum = gcfs_rq->prop_runnable_sum;
> - long runnable_load_avg, load_avg;
> - s64 runnable_load_sum, load_sum;
> + long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
> + unsigned long runnable_load_avg, load_avg;
> + u64 runnable_load_sum, load_sum = 0;
> + s64 delta_sum;
>
> if (!runnable_sum)
> return;
>
> gcfs_rq->prop_runnable_sum = 0;
>
> + if (runnable_sum >= 0) {
> + /*
> + * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
> + * the CPU is saturated running == runnable.
> + */
> + runnable_sum += se->avg.load_sum;
> + runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
> + } else {
> + /*
> + * Estimate the departing task's runnable by assuming all tasks
> + * are equally runnable.
> + *
> + * XXX: doesn't deal with multiple departures?

Why this would not deal with multiple departures ?
we are using gcfs_rq->avg.load_sum that reflects the new state of the
gcfs_rq to evaluate the runnable_sum

> + */
> + if (scale_load_down(gcfs_rq->load.weight)) {
> + load_sum = div_s64(gcfs_rq->avg.load_sum,
> + scale_load_down(gcfs_rq->load.weight));
> + }
> +
> + /* But make sure to not inflate se's runnable */
> + runnable_sum = min(se->avg.load_sum, load_sum);
> + }
> +
> + /* runnable_sum can't be lower than running_sum */
> + running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */

running_sum is scaled by cpu's capacity but not load_sum

I have made the shortcut of using SCHED_CAPACITY_SHIFT for capacity
but we might better use arch_scale_cpu_capacity(NULL, cpu) instead

> + runnable_sum = max(runnable_sum, running_sum);
> +
> load_sum = (s64)se_weight(se) * runnable_sum;
> load_avg = div_s64(load_sum, LOAD_AVG_MAX);
>
> - add_positive(&se->avg.load_sum, runnable_sum);
> - add_positive(&se->avg.load_avg, load_avg);
> + delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
> + delta_avg = load_avg - se->avg.load_avg;
>
> - add_positive(&cfs_rq->avg.load_avg, load_avg);
> - add_positive(&cfs_rq->avg.load_sum, load_sum);
> + se->avg.load_sum = runnable_sum;
> + se->avg.load_avg = load_avg;
> + add_positive(&cfs_rq->avg.load_avg, delta_avg);
> + add_positive(&cfs_rq->avg.load_sum, delta_sum);
>
> runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
> runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
> + delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
> + delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
>
> - add_positive(&se->avg.runnable_load_sum, runnable_sum);
> - add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
> + se->avg.runnable_load_sum = runnable_sum;
> + se->avg.runnable_load_avg = runnable_load_avg;
>
> if (se->on_rq) {
> - add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
> - add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
> + add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
> + add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
> }
> }