Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

From: Morten Rasmussen
Date: Mon Apr 04 2016 - 04:45:57 EST


On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > I think I follow - Leo please correct me if I mangle your intentions.
> > It's an issue that Morten and Dietmar had mentioned to me as well.

Yes. We have been working on this issue for a while without getting to a
nice solution yet.

> > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
> > task group other than the root. The task migrates from one CPU to
> > another. The cfs_rq.avg structures on the src and dest CPUs
> > corresponding to the group containing the task are updated immediately
> > via remove_entity_load_avg()/update_cfs_rq_load_avg() and
> > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
> > on the src and dest are not immediately updated. The root cfs_rq.avg
> > must catch up over time with PELT.

Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where
the is enqueued/dequeued that gets immediately updated. If the cfs_rq is
a group cfs_rq its group entity se.avg doesn't get updated immediately.
It has to adapt over time at the pace defined by the geometric series.
The impact of a task migration therefore doesn't trickle through to the
root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang
introduced with his rewrite of PELT.

> > As to why we care, there's at least one issue which may or may not be
> > Leo's - the root cfs_rq is the one whose avg structure we read from to
> > determine the frequency with schedutil. If you have a cpu-bound task in
> > a non-root cgroup which periodically migrates among CPUs on an otherwise
> > idle system, I believe each time it migrates the frequency would drop to
> > fmin and have to ramp up again with PELT.

It makes any scheduling decision based on utilization difficult if fair
group scheduling is used as cpu_util() doesn't give an up-to-date
picture of any utilization caused by task in task groups.

For the energy-aware scheduling patches and patches we have in the
pipeline for improving capacity awareness in the scheduler we rely on
cpu_util().

> Steve, thanks for explanation and totally agree. My initial purpose is
> not from schedutil's perspective, but just did some basic analysis for
> utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
> launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
> affinity to only two CPUs. So observed the same result like you said.
>
> After applied this patch, I can get much better result for the CPU's
> utilization after task's migration. Please see the parsed result for
> CPU's utilization:
> http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
> http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png
>
> > Leo I noticed you did not modify detach_entity_load_average(). I think
> > this would be needed to avoid the task's stats being double counted for
> > a while after switched_from_fair() or task_move_group_fair().

I'm afraid that the solution to problem is more complicated than that
:-(

You are adding/removing a contribution from the root cfs_rq.avg which
isn't part of the signal in the first place. The root cfs_rq.avg only
contains the sum of the load/util of the sched_entities on the cfs_rq.
If you remove the contribution of the tasks from there you may end up
double-accounting for the task migration. Once due to you patch and then
again slowly over time as the group sched_entity starts reflecting that
the task has migrated. Furthermore, for group scheduling to make sense
it has to be the task_h_load() you add/remove otherwise the group
weighting is completely lost. Or am I completely misreading your patch?

I don't think the slow response time for _load_ is necessarily a big
problem. Otherwise we would have had people complaining already about
group scheduling being broken. It is however a problem for all the
initiatives that built on utilization.

We have to either make the updates trickle through the group hierarchy
for utilization, which is difficult with making a lot of changes to the
current code structure, or introduce a new avg structure at root level
which contains the sum of task utilization for all _tasks_ (not groups)
in the group hierarchy and maintain it separately.

None of those two are particularly pretty. Better suggestions?