Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy

From: Morten Rasmussen
Date: Thu Sep 01 2016 - 10:22:16 EST

Next message: Russell King - ARM Linux: "Re: Applied "mfd: tps65218: add version check to the PMIC probe" to the regulator tree"
Previous message: Li, Liang Z: "RE: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits"
Next in thread: Dietmar Eggemann: "Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Aug 30, 2016 at 03:00:58AM +0800, Yuyang Du wrote:
> On Wed, Aug 24, 2016 at 09:54:35AM +0100, Morten Rasmussen wrote:
> > As Dietmar mentioned already, the 'disconnect' is a feature of the PELT
> > rewrite. Paul and Ben's original implementation had full propagation up
> > and down the hierarchy. IIRC, one of the key points of the rewrite was
> > more 'stable' signals, which we would loose by re-introducing immediate
> > updates throughout hierarchy.
>
> As I mentioned earlier, no essential change!

I don't quite agree with that, it is a very significant change and you
describe the problem yourself further down.

> A feature perhaps is: the
> rewrite takes into account the runnable ratio.
>
> E.g., let there be a group having one task with share 1024, if the task
> sticks to one CPU, and the task is runnable 50% of the time.
>
> With the old implementation, the group_entity_load_avg is 1024; but with
> the rewritten implementation, the group_entity_load_avg is 512. Isn't this
> good?
>
> If the task migrates, the old implementation will still be 1024 on the new
> CPU, but the rewritten implementation will transition to 512, albeit taking
> 0.1+ second time, which we are now addressing. Isn't this good?

No, this is exactly the problem. When you migrate a task, you don't see
the effect immediately after the rewrite. You may have multiple cpus
load-balancing in the meantime before the group load/utilization have
settled after the migration. They will see a wrong picture of the
load/utilization. The cpu where the task was migrated from, appears to
still have load/utilization, and the cpu it has migrated to appears
nearly idle while in fact it is busy running the migrated task. That may
lead other cpus to put even more tasks on the cpu where the task was
migrated to.

We need the cpu load/utilization to be updated immediately to make
meaningful load-balancing decisions. That was ensured by updating the
entire group hierarchy when a task was added/removed before the rewrite.
The task entity contribution was rippling all the way down to the root
cfs_rq load/utilization immediately as all the group cfs_rq
load/utilization directly impacted the group entity load/utilization.
They were sort of 'connected'.

That is no longer the case, and that is why we have problems with
getting an accurate estimate of cpu utilization when we have task
groups.

>
> > It is a significant change to group scheduling, so I'm a bit surprised
> > that nobody has observed any problems post the rewrite. But maybe most
> > users don't care about the load-balance being slightly off when tasks
> > have migrated or new tasks are added to a group.
>
> I don't understand what you are saying.

See above. Maybe nobody has noticed the difference if they primarily
have use-cases with many long-running tasks.

>
> > If we want to re-introduce propagation of both load and utilization I
> > would suggest that we just look at the original implementation. It
> > seemed to work.
> >
> > Handling utilization and load differently will inevitably result in more
> > code. The 'flat hierarchy' approach seems slightly less complicated, but
> > it prevents us from using group utilization later should we wish to do
> > so. It might for example become useful for the schedutil cpufreq
> > governor should it ever consider selecting frequencies differently based
> > on whether the current task is in a (specific) group or not.
>
> I understand group util may have some usage should you attempt to do so, I'm
> not sure how realistic it is.

I'm not sure if it will be useful either, so far it is just an idea.

> Nothing prevents you from knowing the current task is from which (specific)
> group or not.

True, but it might be useful to know the utilization of the entire
group. However, in that case I guess the system-wide group utilization
might be more useful than the group utilization on the particular cpu.
But I haven't thought it through.

Morten

Next message: Russell King - ARM Linux: "Re: Applied "mfd: tps65218: add version check to the PMIC probe" to the regulator tree"
Previous message: Li, Liang Z: "RE: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits"
Next in thread: Dietmar Eggemann: "Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]