Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

From: Morten Rasmussen
Date: Wed Sep 09 2015 - 07:09:33 EST


On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
>
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > >
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> >
> > Something like the below. We should be saving one multiplication.
>
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > return 0;
> > sa->last_update_time = now;
> >
> > - scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > + if (weight || running)
> > + scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > + if (weight)
> > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > + if (running)
> > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > + >> SCHED_CAPACITY_SHIFT;
> >
> > /* delta_w is the amount already accumulated against our next period */
> > delta_w = sa->period_contrib;
> > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > * period and accrue it.
> > */
> > delta_w = 1024 - delta_w;
> > - scaled_delta_w = cap_scale(delta_w, scale_freq);
> > if (weight) {
> > - sa->load_sum += weight * scaled_delta_w;
> > + sa->load_sum += scaled_weight * delta_w;
> > if (cfs_rq) {
> > cfs_rq->runnable_load_sum +=
> > - weight * scaled_delta_w;
> > + scaled_weight * delta_w;
> > }
> > }
> > if (running)
> > - sa->util_sum += scaled_delta_w * scale_cpu;
> > + sa->util_sum += delta_w * scale_freq_cpu;
> >
> > delta -= delta_w;
> >
>
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.
>
> What happens is that GCC gets confused and cannot constant propagate the
> new variables, so what used to be shifts now end up being actual
> multiplications.
>
> With this, I get back to 11. Can you see what happens on ARM where you
> have both functions defined to non constants?

We repeated the experiment on arm and arm64 but still with functions
defined to constant to compare with your results. The mul instruction
count seems to be somewhat compiler version dependent, but consistently
show no effect of the patch:

arm before after
gcc4.9 12 12
gcc4.8 10 10

arm64 before after
gcc4.9 11 11

I will get numbers with the arch-functions implemented as well and do
hackbench runs to see what happens in terms of performance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/