Re: [PATCH] sched/pelt: Fix task util_est update filtering

From: Vincent Guittot
Date: Thu Feb 25 2021 - 10:28:21 EST


On Mon, 22 Feb 2021 at 10:24, Vincent Donnefort
<vincent.donnefort@xxxxxxx> wrote:
>
> On Fri, Feb 19, 2021 at 11:48:28AM +0100, Vincent Guittot wrote:
> > On Tue, 16 Feb 2021 at 17:39, <vincent.donnefort@xxxxxxx> wrote:
> > >
> > > From: Vincent Donnefort <vincent.donnefort@xxxxxxx>
> > >
> > > Being called for each dequeue, util_est reduces the number of its updates
> > > by filtering out when the EWMA signal is different from the task util_avg
> > > by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
> > > decay from a previous high util_avg, EWMA might now be close enough to
> > > the new util_avg. No update would then happen while it would leave
> > > ue.enqueued with an out-of-date value.
> > >
> > > Taking into consideration the two util_est members, EWMA and enqueued for
> > > the filtering, ensures, for both, an up-to-date value.
> > >
> > > This is for now an issue only for the trace probe that might return the
> > > stale value. Functional-wise, it isn't (yet) a problem, as the value is
> >
> > What do you mean by "it isn't (yet) a problem" ? How could this become
> > a problem ?
>
> I wrote "yet" as nothing prevents anyone from using the ue.enqueued signal.

Hmm.. you are not supposed to use it outside the helper functions so
this is irrelevant IMO which means that only the trace probe is
impacted

>
> >
> > > always accessed through max(enqueued, ewma).
> > >
> >
> > This adds more tests and or update of struct avg.util_est. It would
> > be good to have an idea of the perf impact. Especially because this
> > only fixes a tracing problem
>
> I ran hackbench on the big cores of a SD845C board. After 100 iterations of
> 100 loops runs, the geometric mean of the hackbench test is 0.1% lower
> with this patch applied (2.0833s vs 2.0858s). The p-value, computed with
> the ks_2samp [1] is 0.37. We can't conclude that the two distributions are
> different. This patch, in this scenario seems completely harmless.

For such kind of change, perf bench sched pipe is better to highlight
any perf regression. I have done a quick test and i haven't seen
noticeable difference

>
> Shall I include those results in the commit message?
>
> [1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html
>
> >
> >
> > > This problem has been observed using LISA's UtilConvergence:test_means on
> > > the sd845c board.
> > >
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@xxxxxxx>
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 794c2cb945f8..9008e0c42def 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3941,24 +3941,27 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > > trace_sched_util_est_cfs_tp(cfs_rq);
> > > }
> > >
> > > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > > +
> > > /*
> > > - * Check if a (signed) value is within a specified (unsigned) margin,
> > > + * Check if a (signed) value is within the (unsigned) util_est margin,
> > > * based on the observation that:
> > > *
> > > * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> > > *
> > > - * NOTE: this only works when value + maring < INT_MAX.
> > > + * NOTE: this only works when value + UTIL_EST_MARGIN < INT_MAX.
> > > */
> > > -static inline bool within_margin(int value, int margin)
> > > +static inline bool util_est_within_margin(int value)
> > > {
> > > - return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> > > + return ((unsigned int)(value + UTIL_EST_MARGIN - 1) <
> > > + (2 * UTIL_EST_MARGIN - 1));
> > > }
> > >
> > > static inline void util_est_update(struct cfs_rq *cfs_rq,
> > > struct task_struct *p,
> > > bool task_sleep)
> > > {
> > > - long last_ewma_diff;
> > > + long last_ewma_diff, last_enqueued_diff;
> > > struct util_est ue;
> > >
> > > if (!sched_feat(UTIL_EST))
> > > @@ -3979,6 +3982,8 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> > > if (ue.enqueued & UTIL_AVG_UNCHANGED)
> > > return;
> > >
> > > + last_enqueued_diff = ue.enqueued;
> > > +
> > > /*
> > > * Reset EWMA on utilization increases, the moving average is used only
> > > * to smooth utilization decreases.
> > > @@ -3992,12 +3997,19 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> > > }
> > >
> > > /*
> > > - * Skip update of task's estimated utilization when its EWMA is
> > > + * Skip update of task's estimated utilization when its members are
> > > * already ~1% close to its last activation value.
> > > */
> > > last_ewma_diff = ue.enqueued - ue.ewma;
> > > - if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > > + last_enqueued_diff -= ue.enqueued;
> > > + if (util_est_within_margin(last_ewma_diff)) {
> > > + if (!util_est_within_margin(last_enqueued_diff)) {
> > > + ue.ewma = ue.enqueued;

why do you set ewma directly with latest enqueued value ?

> > > + goto done;
> > > + }
> > > +
> > > return;
> > > + }
> > >
> > > /*
> > > * To avoid overestimation of actual task utilization, skip updates if
> > > --
> > > 2.25.1
> > >