Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

From: Sumit Gupta
Date: Fri Apr 28 2023 - 07:03:08 EST




On 28/04/23 02:10, Yang Shi wrote:
External email: Use caution opening links or attachments


On 4/26/23 12:01 PM, Ionela Voinescu wrote:
Hi,

+ Sumit

On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:

On 4/24/23 4:44 AM, Ionela Voinescu wrote:
Hey,

On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
On 4/20/23 9:01 AM, Pierre Gondois wrote:
You say that the cause of this is a congestion in the interconnect. I
don't
see a way to check that right now.
However your trace is on the CPU0, so maybe all the other cores were
shutdown
in your test. If this is the case, do you think a congestion could
happen with
only one CPU ?
No, other CPUs were not shut down in my test. I just ran "yes" on all
cores except CPU 0, then ran the reading freq script. Since all other
cores are busy, so the script should be always running on CPU 0.

Since the counters, memory and other devices are on the interconnect, so
the congestion may be caused by plenty of factors IIUC.
+Ionela

Ionela pointed me to the following patch-set, which seems realated:
https://lore.kernel.org/all/20230418113459.12860-5-sumitg@xxxxxxxxxx/
Thanks for the information. I think we do have the similar syndrome. But I'm
not sure how their counters are implemented, we may not have similar root
cause.
Yes, my bad, I did not get the chance to read this full thread before
talking with Pierre. In your case you have AMUs accessed via MMIO and in that
case they are accessed though FFH (IPIs and system registers). The root
cause is indeed different.
Yeah, but it seems like using larger delay could mitigate both issues.
Yes, there is a minimum delay required there of 25us due to the way that
the counters accumulate, which is not too bad even with interrupts
disabled (to cater to cpufreq_quick_get()).

[..]
Yeah, we should be able to find a smaller irq disable section.

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index c51d3ccb4cca..105a7e2ffffa 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
cppc_perf_fb_ctrs *perf_fb_ctrs)
           struct cppc_pcc_data *pcc_ss_data = NULL;
           u64 delivered, reference, ref_perf, ctr_wrap_time;
           int ret = 0, regs_in_pcc = 0;
+       unsigned long flags;

           if (!cpc_desc) {
                   pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
@@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
cppc_perf_fb_ctrs *perf_fb_ctrs)
                   }
           }

+       local_irq_save(flags);
+
           cpc_read(cpunum, delivered_reg, &delivered);
           cpc_read(cpunum, reference_reg, &reference);
           cpc_read(cpunum, ref_perf_reg, &ref_perf);

+       local_irq_restore(flags);
+
cpc_read_ffh() would return -EPERM if irq is disabled.

So, the irq disabling must happen for mmio only in cpc_read(), for
example:
I thought the issue was that irqs could happen in between cpc_read()
functions,
the patch below would not cover it. If the frequency is more accurate
with this patch, I think I don't understand something.
Yeah, you are correct. The irq disabling window has to cover all the
cpc_read(). I didn't test with this patch. My test was done conceptually
with:

disable irq
cppc_get_perf_ctrs(t0)
udelay(2)
cppc_get_perf_ctrs(t1)
enable irq

But this will break cpc_read_ffh().
Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
register and the reference register, which should have minimal delay in
between?

As in:

if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
      CPC_IN_SYSTEM_MEMORY(reference_reg))
    ...

This will and should not affect FFH - the fix for that would have to be
different.
It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
to read delivered and reference respectively, we just can tell whether they
are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
So the resulting code should conceptually look like:

disable irq
read delivered
enable irq

disable irq
read reference
enable irq

But there still may be interrupts between the two reads. We actually want:

disable irq
read delivered
read reference
enable irq
Yes, this is what I was suggesting above.

I've hacked up the following code. It covers the FFH case as well, with a
modified solution that Sumit proposed on the other thread:

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index 0f17b1c32718..7e828aed3693 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
                              (cpc)->cpc_entry.reg.space_id ==        \
                              ACPI_ADR_SPACE_SYSTEM_IO)

+/* Check if a CPC register is in FFH */
+#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER &&  \
+                             (cpc)->cpc_entry.reg.space_id ==        \
+                             ACPI_ADR_SPACE_FIXED_HARDWARE)
+
  /* Evaluates to True if reg is a NULL register descriptor */
  #define IS_NULL_REG(reg) ((reg)->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY && \
                              (reg)->address == 0 &&                  \
@@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
   *
   * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
   */
+
+struct cycle_counters {
+     int cpunum;
+     struct cpc_register_resource *delivered_reg;
+     struct cpc_register_resource *reference_reg;
+     u64 *delivered;
+     u64 *reference;
+};
+
+static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
+     struct cycle_counters *ctrs = cycle_ctrs;
+
+     cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
+     cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
+
+     return 0;
+}
+
  int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
  {
      struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
@@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
      int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
      struct cppc_pcc_data *pcc_ss_data = NULL;
      u64 delivered, reference, ref_perf, ctr_wrap_time;
+     struct cycle_counters ctrs = {0};
      int ret = 0, regs_in_pcc = 0;
+     unsigned long flags;

      if (!cpc_desc) {
              pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
@@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
              }
      }

-     cpc_read(cpunum, delivered_reg, &delivered);
-     cpc_read(cpunum, reference_reg, &reference);
+     ctrs.cpunum = cpunum;
+     ctrs.delivered_reg = delivered_reg;
+     ctrs.reference_reg = reference_reg;
+     ctrs.delivered = &delivered;
+     ctrs.reference = &reference;
+
+     if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
+             ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
+     } else {
+             if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
+                 CPC_IN_SYSTEM_MEMORY(reference_reg)) {
+                     local_irq_save(flags);
+                     cppc_get_cycle_ctrs(&ctrs);krish
+                     local_irq_restore(flags);
+             } else {
+                     cppc_get_cycle_ctrs(&ctrs);
+             }
+     }
+
      cpc_read(cpunum, ref_perf_reg, &ref_perf);

      /*

I've only tested this on a model using FFH, with 10us delay, and it
worked well for me. Yang, Sumit, could you give it a try?

Thanks for the patch. I tested it with 10us delay, it works well for me.
There was 0 high error in my 3 hours test.

With this change, the delta between the set and get freq has reduced
from ~25% to <10% and occasionally ~15%. Tried with the delay of
10, 25, 50, 75 us.

With the change in [1] & '25us' delay in [2], the delta was <5%.
[1] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@xxxxxxxxxx/
[2] https://lore.kernel.org/all/20230418113459.12860-5-sumitg@xxxxxxxxxx/

Thank you,
Sumit Gupta


Even with a solution like the above (more refined, of course) there is an
additional improvement possible: we can implement arch_freq_get_on_cpu()
for arm64 systems that will use cached (on the tick) AMU cycle counter
samples and have this function called from show_cpuinfo_cur_freq()
before/instead of calling the .get() function. But this will only help
arm64 systems with AMUs accessible though system registers. We'll try to
submit patches on this soon. But as I mentioned to Sumit on the other
thread, the returned frequency value from this will be an average over 4ms
(with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
CPU was idle/isolated).

Thanks,
Ionela.