Re: [RFC PATCH] sched: Consolidate cpufreq updates

From: Qais Yousef
Date: Thu Mar 28 2024 - 17:56:10 EST


On 03/26/24 09:20, Ingo Molnar wrote:
>
> * Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> > Results of `perf stat --repeat 10 perf bench sched pipe` on AMD 3900X to
> > verify any potential overhead because of the addition at context switch
> >
> > Before:
> > -------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 16,839.74 msec task-clock:u # 1.158 CPUs utilized ( +- 0.52% )
> > 0 context-switches:u # 0.000 /sec
> > 0 cpu-migrations:u # 0.000 /sec
> > 1,390 page-faults:u # 83.903 /sec ( +- 0.06% )
> > 333,773,107 cycles:u # 0.020 GHz ( +- 0.70% ) (83.72%)
> > 67,050,466 stalled-cycles-frontend:u # 19.94% frontend cycles idle ( +- 2.99% ) (83.23%)
> > 37,763,775 stalled-cycles-backend:u # 11.23% backend cycles idle ( +- 2.18% ) (83.09%)
> > 84,456,137 instructions:u # 0.25 insn per cycle
> > # 0.83 stalled cycles per insn ( +- 0.02% ) (83.01%)
> > 34,097,544 branches:u # 2.058 M/sec ( +- 0.02% ) (83.52%)
> > 8,038,902 branch-misses:u # 23.59% of all branches ( +- 0.03% ) (83.44%)
> >
> > 14.5464 +- 0.0758 seconds time elapsed ( +- 0.52% )
> >
> > After:
> > -------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 16,219.58 msec task-clock:u # 1.130 CPUs utilized ( +- 0.80% )
> > 0 context-switches:u # 0.000 /sec
> > 0 cpu-migrations:u # 0.000 /sec
> > 1,391 page-faults:u # 85.163 /sec ( +- 0.06% )
> > 342,768,312 cycles:u # 0.021 GHz ( +- 0.63% ) (83.36%)
> > 66,231,208 stalled-cycles-frontend:u # 18.91% frontend cycles idle ( +- 2.34% ) (83.95%)
> > 39,055,410 stalled-cycles-backend:u # 11.15% backend cycles idle ( +- 1.80% ) (82.73%)
> > 84,475,662 instructions:u # 0.24 insn per cycle
> > # 0.82 stalled cycles per insn ( +- 0.02% ) (83.05%)
> > 34,067,160 branches:u # 2.086 M/sec ( +- 0.02% ) (83.67%)
> > 8,042,888 branch-misses:u # 23.60% of all branches ( +- 0.07% ) (83.25%)
> >
> > 14.358 +- 0.116 seconds time elapsed ( +- 0.81% )
>
> Noise caused by too many counters & the vagaries of multi-CPU scheduling is
> drowning out any results here.
>
> I'd suggest somethig like this to measure same-CPU context-switching
> overhead:
>
> taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
>
> ... and make sure the cpufreq governor is at 'performance' first:

performance governor won't stress the patch as the static key should bypass the
new code

>
> for ((cpu=0; cpu < $(nproc); cpu++)); do echo performance > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor; done

There's this short hand if you like

echo performance | sudo tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor

>
> With that approach you should much, much lower noise levels even with just
> 3 runs:
>
> Performance counter stats for 'perf bench sched pipe' (3 runs):
>
> 51,616,501,297 cycles # 3.188 GHz ( +- 0.05% )
> 37,523,641,203 instructions # 0.73 insn per cycle ( +- 0.08% )
> 16,191.01 msec task-clock # 0.999 CPUs utilized ( +- 0.04% )
>
> 16.20511 +- 0.00578 seconds time elapsed ( +- 0.04% )

Thanks for the tips!

I repeated the test using taskset and fewer counters for performance and
schedutil


tip: schedutil:
---------------

Performance counter stats for 'perf bench sched pipe' (10 runs):

829,076,881 cycles:u # 0.077 GHz ( +- 1.26% )
82,712,937 instructions:u # 0.10 insn per cycle ( +- 0.00% )
10,735.67 msec task-clock:u # 1.002 CPUs utilized ( +- 0.08% )

10.71758 +- 0.00840 seconds time elapsed ( +- 0.08% )

tip: performance:
-----------------

Performance counter stats for 'perf bench sched pipe' (10 runs):

871,744,951 cycles:u # 0.079 GHz ( +- 1.04% )
82,711,239 instructions:u # 0.10 insn per cycle ( +- 0.00% )
11,076.50 msec task-clock:u # 1.004 CPUs utilized ( +- 0.20% )

11.0374 +- 0.0216 seconds time elapsed ( +- 0.20% )

tip+patch: schedutil:
---------------------

Performance counter stats for 'perf bench sched pipe' (10 runs):

836,767,470 cycles:u # 0.078 GHz ( +- 0.69% )
82,712,893 instructions:u # 0.10 insn per cycle ( +- 0.00% )
10,825.83 msec task-clock:u # 1.005 CPUs utilized ( +- 0.12% )

10.7751 +- 0.0128 seconds time elapsed ( +- 0.12% )

tip+patch: performance:
-----------------------

Performance counter stats for 'perf bench sched pipe' (10 runs):

842,037,546 cycles:u # 0.077 GHz ( +- 0.97% )
82,717,942 instructions:u # 0.10 insn per cycle ( +- 0.00% )
10,921.37 msec task-clock:u # 0.996 CPUs utilized ( +- 0.18% )

10.9629 +- 0.0202 seconds time elapsed ( +- 0.18% )


Thanks!

--
Qais Yousef