Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

From: Leo Yan
Date: Sat Apr 02 2016 - 03:12:36 EST


On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
> > On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> >> When task is migrated from CPU_A to CPU_B, scheduler will decrease
> >> the task's load/util from the task's cfs_rq and also add them into
> >> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
> >> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
> >> migrated to CPU_B, then CPU_A still have task's stale value for
> >> load/util; on the other hand CPU_B also cannot reflect new load/util
> >> which introduced by the task.
> >>
> >> So this patch is to operate the task's load/util to cpu's cfs_rq, so
> >> finally cpu's cfs_rq can really reflect task's migration.
> >
> > Sorry but that is unintelligible.
> >
> > What is the problem? Why do we care? How did you fix it? and at what
> > cost?

Sorry. I should take time to write commit with quality.

> I think I follow - Leo please correct me if I mangle your intentions.
> It's an issue that Morten and Dietmar had mentioned to me as well.
>
> Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
> task group other than the root. The task migrates from one CPU to
> another. The cfs_rq.avg structures on the src and dest CPUs
> corresponding to the group containing the task are updated immediately
> via remove_entity_load_avg()/update_cfs_rq_load_avg() and
> attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
> on the src and dest are not immediately updated. The root cfs_rq.avg
> must catch up over time with PELT.
>
> As to why we care, there's at least one issue which may or may not be
> Leo's - the root cfs_rq is the one whose avg structure we read from to
> determine the frequency with schedutil. If you have a cpu-bound task in
> a non-root cgroup which periodically migrates among CPUs on an otherwise
> idle system, I believe each time it migrates the frequency would drop to
> fmin and have to ramp up again with PELT.

Steve, thanks for explanation and totally agree. My initial purpose is
not from schedutil's perspective, but just did some basic analysis for
utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
affinity to only two CPUs. So observed the same result like you said.

After applied this patch, I can get much better result for the CPU's
utilization after task's migration. Please see the parsed result for
CPU's utilization:
http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png

> Leo I noticed you did not modify detach_entity_load_average(). I think
> this would be needed to avoid the task's stats being double counted for
> a while after switched_from_fair() or task_move_group_fair().

Will add it.

Thanks,
Leo Yan