Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

From: Morten Rasmussen
Date: Mon Apr 04 2016 - 04:58:38 EST


On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fe30e6..10ca1a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> {
> struct sched_avg *sa = &cfs_rq->avg;
> + struct sched_avg *cpu_sa = NULL;
> int decayed, removed = 0;
> + int cpu = cpu_of(rq_of(cfs_rq));
> +
> + if (&cpu_rq(cpu)->cfs != cfs_rq)
> + cpu_sa = &cpu_rq(cpu)->cfs.avg;
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> +
> + if (cpu_sa) {
> + cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0);
> + cpu_sa->load_sum = max_t(s64,
> + cpu_sa->load_sum - r * LOAD_AVG_MAX, 0);

I think this is not quite right. 'r' is the load contribution removed
from a group cfs_rq. Unless the task is the only task in the group and
the group shares are default, this value is different from the load
contribution of the task at root level (task_h_load()).

And how about nested groups? I think you may end up removing the
contribution of a task multiple times from the root cfs_rq if it is in a
nested group.

You will need to either redesign group scheduling so you get the load to
trickle down automatically and with the right contribution, or maintain
a new sum at the root bypassing the existing design. I'm not sure if the
latter makes sense for load though. It would make more sense for
utilization only.

> @@ -2919,6 +2939,13 @@ skip_aging:
> cfs_rq->avg.load_sum += se->avg.load_sum;
> cfs_rq->avg.util_avg += se->avg.util_avg;
> cfs_rq->avg.util_sum += se->avg.util_sum;
> +
> + if (&cpu_rq(cpu)->cfs != cfs_rq) {
> + cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg;
> + cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum;
> + cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg;
> + cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum;
> + }

Same problem as above. You should be adding the right amount of task
contribution here. Something similar to task_h_load(). The problem with
using task_h_load() is that it is not updated immediately either, so
maintaining a stable signal based on that might be difficult.