Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

From: Peter Zijlstra
Date: Wed Apr 12 2017 - 08:13:26 EST


Let me reply in parts as I read this.. easy things first :-)

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> On 10-Apr 09:36, Peter Zijlstra wrote:

> > 4) they have muddled semantics, because while its presented as a task
> > property, it very much is not.
>
> Actually we always presented it as a task group property, while other
> people suggested it should be a per-task API.
>
> Still, IMO it could make sense also as a per-task API, for example
> considering a specific RT task which we know in our system is
> perfectly fine to always run it below a certain OPP.
>
> Do you think it should be a per-task API?
> Should we focus (at least initially) on providing a per task-group API?

Even for the cgroup interface, I think they should set a per-task
property, not a group property.

> > 3) they have absolutely unacceptable overhead in implementation. Two
> > more RB tree operations per enqueue/dequeue is just not going to
> > happen.
>
> This last point is about "implementation details", I'm pretty
> confident that if we find an agreement on the previous point than
> this last will be simple to solve.
>
> Just to be clear, the rb-trees are per CPU and used to track just the
> RUNNABLE tasks on each CPUs and, as I described in the previous
> example, for the OPP biasing to work I think we really need an
> aggregation mechanism.

I know its runnable, which is exactly what the regular RB tree in fair
already tracks. That gets us 3 RB tree operations per scheduling
operation, which is entirely ridiculous.

And while I disagree with the need to track this at all, see below, it
can be done _much_ cheaper using a double augmented RB-tree, where you
keep the min/max as heaps inside the regular RB-tree.

> Ideally, every time a task is enqueue/dequeued from a CPU we want to
> know what is the currently required capacity clamping. This requires
> to maintain an ordered list of values and rb-trees are the most effective
> solution.
>
> Perhaps, if we accept a certain level of approximation, we can
> potentially reduce the set of tracked values to a finite set (maybe
> corresponding to the EM capacity values) and use just a per-cpu
> ref-counting array.
> Still the min/max aggregation will have a worst case O(N) search
> complexity, where N is the number of finite values we want to use.

So the bigger point is that if the min/max is a per-task property (even
if set through a cgroup interface), the min(max) / max(min) thing is
wrong.

If the min/max were to apply to each individual task's util, you'd end
up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].

Where a possible approximation is scaling the aggregate util into that
domain.