Re: [PATCH v8 -tip 08/26] sched/fair: Snapshot the min_vruntime of CPUs on force idle

From: Peter Zijlstra
Date: Thu Oct 29 2020 - 15:00:54 EST


On Thu, Oct 29, 2020 at 02:24:29PM -0400, Joel Fernandes wrote:

> > @@ -4823,10 +4822,8 @@ pick_next_task(struct rq *rq, struct tas
> > if (!rq_i->core_pick)
> > continue;
> >
> > - if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
> > - !rq_i->core->core_forceidle) {
> > - rq_i->core->core_forceidle = true;
> > - }
> > + if (!(fi_before && rq->core->core_forceidle))
> > + task_vruntime_update(rq_i, rq_i->core_pick);
>
> Shouldn't this be:
>
> if (!fi_before && rq->core->core_forceidle)
> task_vruntime_update(rq_i, rq_i->core_pick);
>
> ?

*groan*, I should've written a comment there :/

When we're not fi, we need to update.
when we're fi and we were not fi, we must update
When we're fi and we were already fi, we must not update

Which gives:

fib fi X

0 0 1
0 1 0
1 0 1
1 1 1

which is: !(!fib && fi) or something.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > + struct rq *rq = task_rq(a);
> > + struct sched_entity *sea = &a->se;
> > + struct sched_entity *seb = &b->se;
> > + struct cfs_rq *cfs_rqa;
> > + struct cfs_rq *cfs_rqb;
> > + s64 delta;
> > +
> > + SCHED_WARN_ON(task_rq(b)->core != rq->core);
> > +
> > + while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> > + int sea_depth = sea->depth;
> > + int seb_depth = seb->depth;
> > +
> > + if (sea_depth >= seb_depth)
> > + sea = parent_entity(sea);
> > + if (sea_depth <= seb_depth)
> > + seb = parent_entity(seb);
> > + }
> > +
> > + if (rq->core->core_forceidle) {
> > + se_fi_update(sea, rq->core->core_forceidle_seq, true);
> > + se_fi_update(seb, rq->core->core_forceidle_seq, true);
> > + }
>
> As we chatted on IRC you mentioned the reason for the sync here is:
>
> say we have 2 cgroups (a,b) under root, and we go force-idle in a, then we
> update a and root. Then we pick and end up in b, but b hasn't been updated
> yet.
>
> One thing I was wondering about that was, if the pick of 'b' happens much
> later than 'a', then the snapshot might be happening too late right?

No, since this is the first pick in b since fi, it cannot have advanced.
So by updating to fi_seq before picking, we guarantee it is unchanged
since we went fi.