Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

From: Valentin Schneider
Date: Mon Jun 08 2020 - 09:06:27 EST



On 08/06/20 13:31, Qais Yousef wrote:
> With uclamp enabled but no fair group I get
>
> *** uclamp enabled/fair group disabled ***
>
> # Executed 50000 pipe operations between two threads
> Total time: 0.856 [sec]
>
> 17.125740 usecs/op
> 58391 ops/sec
>
> The drop is 5.5% in ops/sec. Or 1 usecs/op.
>
> I don't know what's the expectation here. 1 us could be a lot, but I don't
> think we expect the new code to take more than few 100s of ns anyway. If you
> add potential caching effects, reaching 1 us wouldn't be that hard.
>

I don't think it's fair to look at the absolute delta. This being a very
hot path, cumulative overhead gets scary real quick. A drop of 5.5% work
done is a big hour lost over a full processing day.

> Note that in my runs I chose performance governor and use `taskset 0x2` to
> force running on a big core to make sure the runs are repeatable.
>
> On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
> there was weird branching behavior that affects the I$ in my case. It'd be good
> to try it out to see if it makes a difference for you.
>
> The I$ effect is my best educated guess. Perf doesn't catch this path and
> I couldn't convince it to look at cache and branch misses between 2 specific
> points.
>
> Other subtle code shuffling did have weird effect on the result too. One worthy
> one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
> uclamp_rq_inc() noinline *too* cancels this gain out :-/
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..0835ee20a3c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> {
> - enum uclamp_id clamp_id;
> -
> if (unlikely(!p->sched_class->uclamp_enabled))
> return;
>
> - for_each_clamp_id(clamp_id)
> - uclamp_rq_inc_id(rq, p, clamp_id);
> + uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
> + uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
>
> /* Reset clamp idle holding when there is one RUNNABLE task */
> if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
> static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> {
> - enum uclamp_id clamp_id;
> -
> if (unlikely(!p->sched_class->uclamp_enabled))
> return;
>
> - for_each_clamp_id(clamp_id)
> - uclamp_rq_dec_id(rq, p, clamp_id);
> + uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
> + uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
> }
>

That's... Surprising. Did you look at the difference in generated code?

> static inline void
>
>
> FWIW I fail to see activate/deactivate_task in perf record. They don't show up
> on the list which means this micro benchmark doesn't stress them as Mel's test
> does.
>

You're not going to see it them perf on the Juno. They're in IRQ
disabled sections, so AFAICT it won't get sampled as you don't have
NMIs. You can turn on ARM64_PSEUDO_NMI, but you'll need a GICv3 (Ampere
eMAG, Cavium ThunderX2).