Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

From: Vincent Guittot
Date: Fri Oct 13 2017 - 11:23:10 EST


Hi Peter,

Le Tuesday 10 Oct 2017 à 09:44:53 (+0200), Vincent Guittot a écrit :
> On 10 October 2017 at 09:29, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
> >> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> >> > On 1 September 2017 at 15:21, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> >> >> +/*
> >> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> >> >> + * propagate its contribution. The key to this propagation is the invariant
> >> >> + * that for each group:
> >> >> + *
> >> >> + * ge->avg == grq->avg (1)
> >> >> + *
> >> >> + * _IFF_ we look at the pure running and runnable sums. Because they
> >> >> + * represent the very same entity, just at different points in the hierarchy.
> >> >
> >> > I agree for the running part because only one entity can be running
> >> > but i'm not sure for the pure runnable sum because we can have
> >> > several runnable task in a cfs_rq but only one runnable group entity
> >> > to reflect them or I misunderstand (1)
> >
> > The idea is that they (ge and grq) are the _same_ entity, just at
> > different levels in the hierarchy. If the grq is runnable, it is through
> > the ge.
> >
> > As a whole, they don't care how many runnable tasks there are.
>
> Ok so I agree with this point that both grp and ge runnable follow the
> same trend
>
> my point is that even if same changes apply on both grp and ge, we
> can't directly apply changes of grp's runnable on ge's runnable
>
> >
> >> > As an example, we have 2 always running task TA and TB so their
> >> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
> >> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
> >> > only LOAD_AVG_MAX
> >> >
> >> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> >> > and on ge->avg.load_sum in particular, the latter decreases to 0
> >> > whereas it should decrease only by half
> >> >
> >> > I have been able to see this wrong behavior with a rt-app json file
> >> >
> >> > so I think that we should instead remove only
> >> >
> >> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
> >>
> >> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
> >> ge->avg.load_sum
> >>
> >> as the se has already been detached
> >>
> >> > We don't have grq->avg.load_sum but we can have a rough estimate with
> >> > grq->avg.load_avg/grq->weight
> >
> > Hurm, I think I see what you're saying, let me ponder this more.
>
> The formula above works was an example for detaching but it doesn't
> apply for all use case. we need a more generic way to propagate
> runnable changes

I have studied a bit more how to improve the propagation formula and the
changes below is doing the job for the UCs that I have tested.

Unlike running, we can't directly propagate the runnable through hierarchy
when we migrate a task. Instead we must ensure that we will not
over/underestimate the impact of the migration thanks to several rules:
-ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
-ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
the same range)
-we can't directly propagate a negative delta of runnable_sum because part of
this runnable time can be "shared" with others sched_entities and stays on the
gcfs_rq. Instead, we can't estimate the new runnable_sum of the gcfs_rq with
the formula: gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
-ge->avg.runnable_sum can't increase when we detach a task.

---
kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec0..a063b048 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long runnable_sum = gcfs_rq->prop_runnable_sum;
- long runnable_load_avg, load_avg;
- s64 runnable_load_sum, load_sum;
+ long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+ long runnable_load_avg, delta_avg, load_avg;
+ s64 runnable_load_sum, delta_sum, load_sum = 0;

if (!runnable_sum)
return;

gcfs_rq->prop_runnable_sum = 0;

+ /*
+ * Get a rough estimate of gcfs_rq's runnable
+ * This is a low guess as it assumes that tasks are equally
+ * runnable which is not true but we can't really do better
+ */
+ if (scale_load_down(gcfs_rq->load.weight))
+ load_sum = div_s64(gcfs_rq->avg.load_sum,
+ scale_load_down(gcfs_rq->load.weight));
+
+ /*
+ * Propating a delta of runnable is not just adding it to ge's
+ * runnable_sum:
+ * - Adding a delta runnable can't make ge's runnable_sum higher than
+ * LOAD_AVG_MAX
+ * - We can't directly remove a delta of runnable from
+ * ge's runnable_sum but we can only guest estimate what runnable
+ * will become thanks to few simple rules:
+ * - gcfs_rq's runnable is a good estimate
+ * - ge's runnable_sum can't increase when we remove runnable
+ * - runnable_sum can't be lower than running_sum
+ */
+ if (runnable_sum >= 0) {
+ runnable_sum += se->avg.load_sum;
+ runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
+ } else
+ runnable_sum = min(se->avg.load_sum, load_sum);
+
+ running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+ runnable_sum = max(runnable_sum, running_sum);
+
load_sum = (s64)se_weight(se) * runnable_sum;
load_avg = div_s64(load_sum, LOAD_AVG_MAX);

- add_positive(&se->avg.load_sum, runnable_sum);
- add_positive(&se->avg.load_avg, load_avg);
+ delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+ delta_avg = load_avg - se->avg.load_avg;

- add_positive(&cfs_rq->avg.load_avg, load_avg);
- add_positive(&cfs_rq->avg.load_sum, load_sum);
+ se->avg.load_sum = runnable_sum;
+ se->avg.load_avg = load_avg;
+ add_positive(&cfs_rq->avg.load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.load_sum, delta_sum);

runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+ delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+ delta_avg = runnable_load_avg - se->avg.runnable_load_avg;

- add_positive(&se->avg.runnable_load_sum, runnable_sum);
- add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+ se->avg.runnable_load_sum = runnable_sum;
+ se->avg.runnable_load_avg = runnable_load_avg;

if (se->on_rq) {
- add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
- add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+ add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
}
}

--
2.7.4