Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

From: Steve Muckle
Date: Fri Apr 01 2016 - 18:28:55 EST


On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
>> When task is migrated from CPU_A to CPU_B, scheduler will decrease
>> the task's load/util from the task's cfs_rq and also add them into
>> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
>> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
>> migrated to CPU_B, then CPU_A still have task's stale value for
>> load/util; on the other hand CPU_B also cannot reflect new load/util
>> which introduced by the task.
>>
>> So this patch is to operate the task's load/util to cpu's cfs_rq, so
>> finally cpu's cfs_rq can really reflect task's migration.
>
> Sorry but that is unintelligible.
>
> What is the problem? Why do we care? How did you fix it? and at what
> cost?

I think I follow - Leo please correct me if I mangle your intentions.
It's an issue that Morten and Dietmar had mentioned to me as well.

Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
task group other than the root. The task migrates from one CPU to
another. The cfs_rq.avg structures on the src and dest CPUs
corresponding to the group containing the task are updated immediately
via remove_entity_load_avg()/update_cfs_rq_load_avg() and
attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
on the src and dest are not immediately updated. The root cfs_rq.avg
must catch up over time with PELT.

As to why we care, there's at least one issue which may or may not be
Leo's - the root cfs_rq is the one whose avg structure we read from to
determine the frequency with schedutil. If you have a cpu-bound task in
a non-root cgroup which periodically migrates among CPUs on an otherwise
idle system, I believe each time it migrates the frequency would drop to
fmin and have to ramp up again with PELT.

Leo I noticed you did not modify detach_entity_load_average(). I think
this would be needed to avoid the task's stats being double counted for
a while after switched_from_fair() or task_move_group_fair().