Re: [PATCH v3 00/21] cpufreq: introduce a new AMD CPU frequency control mechanism

From: Giovanni Gherdovich
Date: Thu Nov 04 2021 - 12:40:27 EST


On Fri, 2021-10-29 at 21:02 +0800, Huang Rui wrote:
> Hi all,
>
> We would like to introduce a new AMD CPU frequency control mechanism as the
> "amd-pstate" driver for modern AMD Zen based CPU series in Linux Kernel.
>
> ..snip..

Hello,

I've tested this driver and it seems the results are a little underwhelming.
The test machine is a two sockets server with two AMD EPYC 7713,
family:model:stepping 25:1:1, 128 cores/256 threads, 256G of memory and SSD
storage. On this system, the amd-pstate driver works only in "shared memory
support", not in "full MSR support", meaning that frequency switches are
triggered from a workqueue instead of scheduler context (!fast_switch).

Dbench sees some ludicrous improvements in both performance and performance
per watt; likewise netperf sees some modest improvements, but that's about
the only good news. Schedutil/ondemand on tbench and hackbench do worse
with amd-pstate than acpi-cpufreq. I don't have data for
ondemand/amd-pstate on kernbench and gitsource, but schedutil regresses on
both.

Here the tables, then some questions & discussion points.

Tilde (~) means the result is the same as baseline (which is, the ratio is close to 1).
"Sugov" means "schedutil governor", "perfgov" means "performance governor".

: acpi-cpufreq : amd-pstate :
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
: ondemand sugov perfgov : ondemand sugov perfgov : better if
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PERFORMANCE RATIOS
dbench : 1.00 ~ 0.33 : 0.37 0.35 0.36 : lower
netperf : 1.00 0.97 ~ : 1.03 1.04 ~ : higher
tbench : 1.00 1.04 1.06 : 0.83 0.40 1.05 : higher
hackbench : 1.00 ~ 1.03 : 1.09 1.42 1.03 : lower
kernbench : 1.00 0.96 0.97 : N/A 1.08 ~ : lower
gitsource : 1.00 0.67 0.69 : N/A 0.79 0.67 : lower
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PERFORMANCE-PER-WATT RATIOS
dbench4 : 1.00 ~ 3.37 : 2.68 3.12 3.03 : higher
netperf : 1.00 0.96 ~ : 1.09 1.06 ~ : higher
tbench4 : 1.00 1.03 1.06 : 0.76 0.34 1.04 : higher
hackbench : 1.00 ~ 0.95 : 0.88 0.65 0.96 : higher
kernbench : 1.00 1.06 1.05 : N/A 0.93 1.05 : higher
gitsource : 1.00 1.53 1.50 : N/A 1.33 1.55 : higher


How to read the table: all numbers are ratios of the results of some
governor/driver combination and ondemand/acpi-cpufreq, which is the
baseline (first column). When the "better if" column says "higher", a ratio
larger than 1 indicates an improvement; otherwise it's a regression.
Example: hackbench with sugov/amd-pstate is 42% slower than with
ondemand/acpi-cpufreq (top table). At the same time, it's also 35% less
efficient (bottom table).

Now, some questions / possible troubleshooting directions:

- ACPI-CPUFREQ DRIVER: REQUESTS ARE HINTS OR MANDATES?
When using acpi-cpufreq, and the OS requests some frequency (one of the
three allowed P-States), does the hardware underneath stick to it? Or
does it do some ulterior adjustment based on the load?
This would tell if a machine using acpi-cpufreq is less dumb than it
seems, and can in principle do fine-grain adjustments all the same.

- PROCESSING CPPC DOORBELL REQUESTS: HOW FAST IS THAT?
How long does it take the hardware to process the CPPC doorbell
request to change frequency? What happens to outstanding requests, if
they're not processed in a timely manner? Is there any queue of requests,
and if so, how long is it? Could it be that if requests come in too quickly
the CPU ends up playing catch-up on freq switches that are obsoletes or
redundant?

- LIKE-FOR-LIKE: TRY BENCHMARKING WITH AMD-PSTATE LIMITED TO 3 P-STATES?
Could it be that to study the performance of the "shared memory support"
system against acpi-cpufreq a more like-to-like comparison would be to limit
amd-pstate to only the 3 P-States available to acpi-cpufreq? That would be
for experimental/benchmarking purposes only. Eg: on my machines acpi-cpufreq
sees 1.5GHz, 1.7GHz and 2GHz. Given that max boost is 3.72GHz, and the CPPC
range is the abstract interval 0..255, I could limit amd-pstate to only set
performance level of 68, 102 and 137, and see what it gives against the old
driver. What do you think?

- PROCESSING CPPC DOORBELL REQS IS SLOW. BUT /MAKING/ A REQUEST, SLOW TOO?
Looks to me that with the "shared memory support" the frequency update
process is doubly asynchronous: first we have the ->target() callback
deferred to a workqueue, then when it's eventually executed, it calls
cppc_update_perf() which again just asks the firmware to do work at a
later time. Are we sure that cppc_update_perf() is actually so slow to
warrant !fast_switch?

- HOW MANY P-STATES ARE TOO MANY?
I've always believed the contrary, but what if having too many P-States is
harmful for both performance and efficiency? Maybe the governor is
requesting many updates in small increments where less (and larger) updates
would be more appropriate?


Thanks,
Giovanni