Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

From: Vincent Guittot
Date: Wed Oct 24 2018 - 05:07:40 EST


Hi Pavan,

On Wed, 24 Oct 2018 at 06:53, Pavan Kondeti <pkondeti@xxxxxxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> Thanks for the detailed explanation.
>
> On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> > Hi Pavan,
> >
> > On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti <pkondeti@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > > >
> > > > /*
> > > > + * The clock_pelt scales the time to reflect the effective amount of
> > > > + * computation done during the running delta time but then sync back to
> > > > + * clock_task when rq is idle.
> > > > + *
> > > > + *
> > > > + * absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > > + * @ max capacity ------******---------------******---------------
> > > > + * @ half capacity ------************---------************---------
> > > > + * clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16
> > > > + *
> > > > + */
> > > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > > +{
> > > > +
> > > > + if (is_idle_task(rq->curr)) {
> > > > + u32 divider = (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > > + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > > + overload += rq->avg_rt.util_sum;
> > > > + overload += rq->avg_dl.util_sum;
> > > > +
> > > > + /*
> > > > + * Reflecting some stolen time makes sense only if the idle
> > > > + * phase would be present at max capacity. As soon as the
> > > > + * utilization of a rq has reached the maximum value, it is
> > > > + * considered as an always runnnig rq without idle time to
> > > > + * steal. This potential idle time is considered as lost in
> > > > + * this case. We keep track of this lost idle time compare to
> > > > + * rq's clock_task.
> > > > + */
> > > > + if (overload >= divider)
> > > > + rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > > > +
> > >
> > > I am trying to understand this better. I believe we run into this scenario, when
> > > the frequency is limited due to thermal/userspace constraints. Lets say
> >
> > Yes these are the most common UCs but this can also happen after tasks
> > migration or with a cpufreq governor that doesn't increase OPP fast
> > enough for current utilization.
> >
> > > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > > Fmax/2. The utilization is built up to 100% after several periods.
> > > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
> > > all along. What happens when the CPU enters idle for a short duration and comes
> > > back to run this 100% utilization task?
> >
> > If you are at 100%, we only apply the short idle duration
> >
> > >
> > > If the above block is not present i.e lost_idle_time is not tracked, we
> > > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > > utilization is dropped. Right?
> >
> > yes that 's what would happen. I gives more details below
> >
> > >
> > > With the above block, we don't stretch the idle time. In fact we don't
> > > consider the idle time at all. Because,
> > >
> > > idle_time = now - last_time;
> > >
> > > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
> > > idle_time = rq->clock_pelt_old - last_time
> > >
> > > The last time is nothing but the last snapshot of the rq->clock_pelt when the
> > > task entered sleep due to which CPU entered idle.
> >
> > The condition for dropping this idle time is quite important. This
> > only happens when the utilization reaches max compute capacity of the
> > CPU. Otherwise, the idle time will be fully applied
>
> Right.
>
> rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt
>
> This not only tracks the lost idle time due to running slow but also the
> absolute/real sleep time. For example, when the slow running 100% task
> sleeps for 100 msec, are not we ignoring the 100 msec sleep there?
>
> For example a task ran 323 msec at full capacity and sleeps for (1000-323)
> msec. when it wakes up the utilization is dropped. If the same task runs
> for 626 msec at the half capacity and sleeps for (1000-626), should not
> drop the utilization by taking (1000-626) sleep time into account. I
> understand that why we don't strech idle time to (1000-323) but it is not
> clear to me why we completely drop the idle time.

So this should not happen.
I' m going to update the way I track lost idle time and move it out
of update_rq_clock_pelt() and only do the test when entering idle
This is even better as it simplifies update_rq_clock_pelt() and
reduces the number of tests for lost idle time

Thanks for spotting this

I'm preparing a new version with this, some build fix for !SMP and the
alignement with cache line suggested by Peter

Vincent
>
> >
> > >