Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

From: Morten Rasmussen
Date: Tue Apr 05 2016 - 05:11:00 EST


On Tue, Apr 05, 2016 at 02:56:44PM +0800, Leo Yan wrote:
> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > > > I think I follow - Leo please correct me if I mangle your intentions.
> > > > It's an issue that Morten and Dietmar had mentioned to me as well.
> >
> > Yes. We have been working on this issue for a while without getting to a
> > nice solution yet.
>
> Good to know this. This patch is mainly for discussion purpose.
>
> [...]
>
> > > > Leo I noticed you did not modify detach_entity_load_average(). I think
> > > > this would be needed to avoid the task's stats being double counted for
> > > > a while after switched_from_fair() or task_move_group_fair().
> >
> > I'm afraid that the solution to problem is more complicated than that
> > :-(
> >
> > You are adding/removing a contribution from the root cfs_rq.avg which
> > isn't part of the signal in the first place. The root cfs_rq.avg only
> > contains the sum of the load/util of the sched_entities on the cfs_rq.
> > If you remove the contribution of the tasks from there you may end up
> > double-accounting for the task migration. Once due to you patch and then
> > again slowly over time as the group sched_entity starts reflecting that
> > the task has migrated. Furthermore, for group scheduling to make sense
> > it has to be the task_h_load() you add/remove otherwise the group
> > weighting is completely lost. Or am I completely misreading your patch?
>
> Here have one thing want to confirm firstly: though CFS has maintained
> task group's hierarchy, but between task group's cfs_rq.avg and root
> cfs_rq.avg, CFS updates these signals independently rather than accouting
> them by crossing the hierarchy.
>
> So currently CFS decreases the group's cfs_rq.avg for task's migration,
> but it don't iterate task group's hierarchy to root cfs_rq.avg. I
> don't understand your meantioned the second accounting by "then again
> slowly over time as the group sched_entity starts reflecting that the
> task has migrated."

The problem is that there is direct link between a group sched_entity's
se->avg and se->my_q.avg. The latter is the sum of PELT load/util of the
sched_entities (tasks or nested groups) on the group cfs_rq, while the
former is the PELT load/util of the group entity which is not based on
cfs_rq sum, but basically just tracks whether that group entity has been
running/runnable or not, but weighted by group load code which is
updating the weight occasionally.

In other words, we do go up/down the hierarchy when tasks migrate, but
we only update the se->my_q.avg (cfs_rq), not the se->avg which is the
load of the group seen by the parent cfs_rq. So the immediate update of
the group cfs_rq.avg where the task sched_entity is enqueued/dequeued
doesn't trickle through the hierarchy instantaneously.

> Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but
> not present behavior? so even the task has been migrated we still need
> decay it slowly? Or this will be different between load and util?

cfs_rq.avg is instantaneously updated on task migration as it is the sum
of the PELT contributions of the sched_entities associated with that
cfs_rq. The group se->avg is not a sum, it behaves just as if it a task
which has a variable load_weight which is determined by group weighting
code, but otherwise identical. No adding/removing of contributions when
tasks migrate.

>
> > I don't think the slow response time for _load_ is necessarily a big
> > problem. Otherwise we would have had people complaining already about
> > group scheduling being broken. It is however a problem for all the
> > initiatives that built on utilization.
>
> Or maybe we need seperate utilization and load, these two signals
> have different semantics and purpose.

I think that is up for discussion. People might have different views on
the semantics of utilization. I see them as very similar in the
non-group scheduling case, one is based on running time and not priority
weighted, the other is based on runnable time and has priority
weighting. Otherwise they are the same.

However, in the group scheduling case, I think they should behave
somewhat differently. Load is priority scaled and is designed to ensure
fair scheduling when the system is fully utilized, where utilization
provides a metric the estimates the actual busy time of the cpus. Group
load is scaled such that is capped no matter how much actual cpu time
the group gets across the system. I don't think it makes sense to do the
same for utilization as it would not represent the actual compute
demand. It should be treated as a 'flat hierarchy' as Yuyang mentions in
his reply, so the sum at the root cfs_rq is a proper estimate of the
utilization of the cpu regardless of whether tasks are grouped or not.