Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()

From: Peter Zijlstra
Date: Fri Sep 29 2017 - 07:35:29 EST


On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:

> > +/*
> > + * All this does is approximate the hierarchical proportion which includes that
> > + * global sum we all love to hate.
> > + *
> > + * That is, the weight of a group entity, is the proportional share of the
> > + * group weight based on the group runqueue weights. That is:
> > + *
> > + * tg->weight * grq->load.weight
> > + * ge->load.weight = ----------------------------- (1)
> > + * \Sum grq->load.weight
> > + *
> > + * Now, because computing that sum is prohibitively expensive to compute (been
> > + * there, done that) we approximate it with this average stuff. The average
> > + * moves slower and therefore the approximation is cheaper and more stable.
> > + *
> > + * So instead of the above, we substitute:
> > + *
> > + * grq->load.weight -> grq->avg.load_avg (2)
> > + *
> > + * which yields the following:
> > + *
> > + * tg->weight * grq->avg.load_avg
> > + * ge->load.weight = ------------------------------ (3)
> > + * tg->load_avg
> > + *
> > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> > + *
> > + * That is shares_avg, and it is right (given the approximation (2)).
> > + *
> > + * The problem with it is that because the average is slow -- it was designed
> > + * to be exactly that of course -- this leads to transients in boundary
> > + * conditions. In specific, the case where the group was idle and we start the
> > + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> > + * yielding bad latency etc..
> > + *
> > + * Now, in that special case (1) reduces to:
> > + *
> > + * tg->weight * grq->load.weight
> > + * ge->load.weight = ----------------------------- = tg>weight (4)
> > + * grp->load.weight
>
> Should it be "grq->load.weight" in the denominator of (4)?
> And "tg->weight" at the end?

Yes, otherwise its all doesn't really make sense :-) Typing is hard it
seems.

> > + *
> > + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
>
> Shouldn't (3) collapse in the same way too in this special case?

That's more difficult to see and (1) is the canonical form.

> In theory it should reduce to:
>
> tg->weight * grq->avg.load_avg
> ge->load.weight = ------------------------------
> grq->avg.load_avg

Yes, agreed.

> But I can see many reasons why it won't happen in practice if things
> aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
> (3) aren't in sync, or there are stale contributions to tg->load_avg
> from other cpus then (3) can return anything between 0 and tg->weight.

Just so.

> > + *
> > + * So what we do is modify our approximation (3) to approach (4) in the (near)
> > + * UP case, like:
> > + *
> > + * ge->load.weight =
> > + *
> > + * tg->weight * grq->load.weight
> > + * --------------------------------------------------- (5)
> > + * tg->load_avg - grq->avg.load_avg + grq->load.weight
> > + *
> > + *
> > + * And that is shares_weight and is icky. In the (near) UP case it approaches
> > + * (4) while in the normal case it approaches (3). It consistently
> > + * overestimates the ge->load.weight and therefore:
> > + *
> > + * \Sum ge->load.weight >= tg->weight
> > + *
> > + * hence icky!
>
> IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> tasks, you can end up with underestimating the ge->load.weight for some
> of the group entities lead to \Sum ge->load.weight < tg->weight.

Ah yes, you're right. However, if you look at the end of the series we
actually end up with using:

max(grq->load.weight, grq->avg.load_avg)

Which I suppose makes it true again.