Re: [PATCH v2] sched/fair: update scale invariance of PELT

From: Vincent Guittot
Date: Wed Apr 12 2017 - 10:50:59 EST


Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 11, 2017 at 03:09:20PM +0200, Vincent Guittot wrote:
> > Le Tuesday 11 Apr 2017 à 12:49:49 (+0200), Peter Zijlstra a écrit :
> > >
> > > Lets go back to the unscaled version:
> > >
> > > running idle
> > > |*********|---------|
> > >
> > > With the current code, that would effectively end up like (again
> > > assuming 50%):
> > >
> > > running idle
> > > |*_*_*_*_*|---------|
> > >
> > >
> > > Time stays the same, but we add extra idle cycles.
> >
> > In fact it's not really like that because this doesn't reflect the impact of
> > the decay window which is not linear with time
> >
> > For a task that run at max freq
> >
> > (1) |-------|-------|-------|----
> > **** ****
> > ---****------------****------
> >
> > The same task running at half frequency, will run twice longer
> > |-------|-------|-------|----
> > ******** ********
> > ---********--------********--
> >
> > With the current implementation, we are not inserting idle cycle but
> > dividing the contribution by half in order to compensate the fact that the task
> > will run twice longer:
> >
> > (2) |-------|-------|-------|----
> >
> > ---********--------********--
> >
> > But the final result is neither equal to
> >
> > |-------|-------|-------|----
> > * * * * * * * *
> > ---*_*_*_*---------*_*_*_*---
> >
> > nor to (1) as described below:
>
> I'm not sure I get what you say; dividing the contribution in half is
> exactly equal to the above picture. Because as you can see, there's half
> the number of * per window.

yes you're right, the above is equal to (2) but not to (1)
I mess up the _ which some additional decays

>
> > Let assume all durations are aligned with decay window. This will simplify the
> > formula for the explanation and will not change the demonstration. We can come back
> > on the full equation later on.
> >
> > For (1), we have the equation that you write previously:
> >
> > util_sum' = utilsum * y^p +
> > p-1
> > d1 * y^p + 1024 * \Sum y^n + d3 * y^0
> > n=1
> >
> > Which becomes like below when we are aligned to decay window (d1 and d3 are null)
>
> > (A) util_sum' = utilsum * y^p +
> > p-1
> > 1024 * \Sum y^n
> > n=1
> >
> > The running time at max frequency is d and p is the number of decay window period
> > In this case p = d/1024us and l = 0
> >
> > For (2), task runs at half frequency so the duration d' is twice longer than
> > d and p'=2*p. In current implementation, we compensate the increase of running
> > duration (and lost idle time) by dividing the contribution by 2:
>
> > util_sum' = utilsum * y^p' +
> > p'-1
> > 512 * \Sum y^n
> > n=1
> >
> > (B) util_sum' = utilsum * y^(2*p) +
> > 2*p-1
> > 512 * \Sum y^n
> > n=1
>
> Hmmm.. but 2p is the actual wall-time of the window period.
>
> > With the new scale invariance, we have the following pattern before scaling
> > time:
> >
> > |-------|-------|-------|--
> > ******** ********
> > ---********--------********
> >
> > Task still runs at half frequency so the duration d' is twice longer than
> > d and p': p'=2*p just like for (2).
>
> > But before computing util_sum', we change the temporal referencial to reflect
> > what would have been done at max frequency: we scale the running time (divide it by 2)
> > in order to have something like:
> >
> > |-------|-------|-------|--
> > **** ****
> > ---****____--------****____
>
> So you're moving the window edges away from wall-time.
>
> > so we now have a duration d'' that is half d'and a number of period p'' that
> > is half p' so p'' = 1/2 * p' == p
>
> > util_sum' = utilsum * y^p'' +
> > p''-1
> > 1024 * \Sum y^n
> > n=1
> > util_sum' = utilsum * y^p +
> > p-1
> > 512 * \Sum y^n
> > n=1
>
>
> |---------|---------| (wall-time)
> ----****------------- F=100%
> ----******----------- F= 66%
> |--------------|----| (fudge-time)

It has been a bit hard for me to catch the diagram above because you scale the
idle time to get same ratio at 100% and 66% wherease I don't scale idle
time but only running time.
Then, we are not reducing the next window but delaying it
so it's something like below

|---------|---------| (wall-time)
----****------------- F=100%
----******----------- F= 66%
|--------------|---------| (fudge-time)

>
> (explicitly not used 50%, because then the second window would have
> collapsed to 0, imagine the joy if you go lower still)

The second window can't collapse because we are working on delta time not
absolute wall-time and the delta is for only 1 type at a time: running or idle

>
>
> So in fudge-time the first window has 6/15 == 4/10 for the max-freq /
> wall-time combo.
>
> >
> > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay
> > window when the task is sleeping
> >
> > so at the end we have a number of decay window of p''+l = p'' so we still have
> > the same amount of decay window than previously.
>
> Now, we have to stretch time back to equal window size, and while you do
> that for the active windows, we have to do manual compensation for idle
> windows (which is somewhat ugleh) and is where the lost-time comes from.

We can't stretch idle time because there is no relation between the idle time
and the current capacity.

>
> Also, this all feels entirely yucky, because as per the above, if we'd
> ran at 33%, we'd have ended up with a negative time window.

Not sure to catch how we can end up with negative window. We are working with
delta time not absolute time.

>
>
> Not to mention that this only seems to work for low utilization. Once
> you hit higher utilization scenarios, where there isn't much idle time
> to compensate for the stretching, things go wobbly. Although both
> scenarios might end up being the same.

During the running phase, we calculate how much idle time has diseappeared
because we are running at lower frequency and we compensate it once back to
idle.

>
> And instead of resurrecting 0 sized windows, you throw them out, which

I don't catch point above

> results in max-util at low F being reported as max-util at F=1, which I
> suppose is a useful property and results in the increased ramp-up (which
> is a desired property).
>
> So not only do you have non-linear time, you also have non-continuous
> time.
>
> I still wonder about the whole !running vs !weight thing.. this all
> needs more thinking.
>