Re: [PATCH] perf/intel: Remove Perfmon-v4 counter_freezing support

From: Stephane Eranian
Date: Tue Nov 10 2020 - 15:52:20 EST


On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > > platforms can provide an architectural state of the instruction executed
> > > after the instruction that caused the event. This patch set enables the
> > > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > > The Linux guest can use PEBS feature like native:
> > >
> > > # perf record -e instructions:ppp ./br_instr a
> > > # perf record -c 100000 -e instructions:pp ./br_instr a
> > >
> > > If the counter_freezing is not enabled on the host, the guest PEBS will
> > > be disabled on purpose when host is using PEBS facility. By default,
> > > KVM disables the co-existence of guest PEBS and host PEBS.
> >
> > Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> > me go delete all that code.
>
> ---
> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>
> Perfmon-v4 counter freezing is fundamentally broken; remove this default
> disabled code to make sure nobody uses it.
>
> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
> there wouldn't actually be a problem, *however* it does something subtly
> different. It globally disables the whole PMU when it raises the PMI,
> not when the PMI hits.
>
> This means there's a window between the PMI getting raised and the PMI
> actually getting served where we loose events and this violates the
> perf counter independence. That is, a counting event should not result
> in a different event count when there is a sampling event co-scheduled.
>

What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
That, in itself, is a problem. I agree with you on that point.

However, there are use cases for both modes.

I can sample on event A and count on B, C and when A overflows, I want
to snapshot B, C.
For that I want B, C at the moment of the overflow, not at the moment
the PMI is delivered. Thus, youd
would want the Freeze-on-overflow behavior. You can collect in this
mode with the perf tool,
IIRC: perf record -e '{cycles,instructions,branches:S}' ....

The other usage model is that of the replay-debugger (rr) which you are alluding
to, which needs precise count of an event including during the skid
window. For that, you need
Freeze-on-PMI (delivered). Note that this tool likely only cares about
user level occurrences of events.

As for counter independence, I am not sure it holds in all cases. If
the events are setup for user+kernel
then, as soon as you co-schedule a sampling event, you will likely get
more counts on the counting
event due to the additional kernel entries/exits caused by
interrupt-based profiling. Even if you were to
restrict to user level only, I would expect to see a few more counts.


> This is known to break existing software.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> arch/x86/events/intel/core.c | 152 -------------------------------------------
> arch/x86/events/perf_event.h | 3 +-
> 2 files changed, 1 insertion(+), 154 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c79748f6921d..9909dfa6fb12 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
> intel_pmu_enable_all(added);
> }
>
> -static void enable_counter_freeze(void)
> -{
> - update_debugctlmsr(get_debugctlmsr() |
> - DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> -static void disable_counter_freeze(void)
> -{
> - update_debugctlmsr(get_debugctlmsr() &
> - ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> static inline u64 intel_pmu_get_status(void)
> {
> u64 status;
> @@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> return handled;
> }
>
> -static bool disable_counter_freezing = true;
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> - bool res;
> -
> - if (kstrtobool(s, &res))
> - return -EINVAL;
> -
> - disable_counter_freezing = !res;
> - return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
> -
> -/*
> - * Simplified handler for Arch Perfmon v4:
> - * - We rely on counter freezing/unfreezing to enable/disable the PMU.
> - * This is done automatically on PMU ack.
> - * - Ack the PMU only after the APIC.
> - */
> -
> -static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
> -{
> - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> - int handled = 0;
> - bool bts = false;
> - u64 status;
> - int pmu_enabled = cpuc->enabled;
> - int loops = 0;
> -
> - /* PMU has been disabled because of counter freezing */
> - cpuc->enabled = 0;
> - if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
> - bts = true;
> - intel_bts_disable_local();
> - handled = intel_pmu_drain_bts_buffer();
> - handled += intel_bts_interrupt();
> - }
> - status = intel_pmu_get_status();
> - if (!status)
> - goto done;
> -again:
> - intel_pmu_lbr_read();
> - if (++loops > 100) {
> - static bool warned;
> -
> - if (!warned) {
> - WARN(1, "perfevents: irq loop stuck!\n");
> - perf_event_print_debug();
> - warned = true;
> - }
> - intel_pmu_reset();
> - goto done;
> - }
> -
> -
> - handled += handle_pmi_common(regs, status);
> -done:
> - /* Ack the PMI in the APIC */
> - apic_write(APIC_LVTPC, APIC_DM_NMI);
> -
> - /*
> - * The counters start counting immediately while ack the status.
> - * Make it as close as possible to IRET. This avoids bogus
> - * freezing on Skylake CPUs.
> - */
> - if (status) {
> - intel_pmu_ack_status(status);
> - } else {
> - /*
> - * CPU may issues two PMIs very close to each other.
> - * When the PMI handler services the first one, the
> - * GLOBAL_STATUS is already updated to reflect both.
> - * When it IRETs, the second PMI is immediately
> - * handled and it sees clear status. At the meantime,
> - * there may be a third PMI, because the freezing bit
> - * isn't set since the ack in first PMI handlers.
> - * Double check if there is more work to be done.
> - */
> - status = intel_pmu_get_status();
> - if (status)
> - goto again;
> - }
> -
> - if (bts)
> - intel_bts_enable_local();
> - cpuc->enabled = pmu_enabled;
> - return handled;
> -}
> -
> /*
> * This handler is triggered by the local APIC, so the APIC IRQ handling
> * rules apply:
> @@ -4081,9 +3980,6 @@ static void intel_pmu_cpu_starting(int cpu)
> if (x86_pmu.version > 1)
> flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
>
> - if (x86_pmu.counter_freezing)
> - enable_counter_freeze();
> -
> /* Disable perf metrics if any added CPU doesn't support it. */
> if (x86_pmu.intel_cap.perf_metrics) {
> union perf_capabilities perf_cap;
> @@ -4154,9 +4050,6 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
> static void intel_pmu_cpu_dying(int cpu)
> {
> fini_debug_store_on_cpu(cpu);
> -
> - if (x86_pmu.counter_freezing)
> - disable_counter_freeze();
> }
>
> void intel_cpuc_finish(struct cpu_hw_events *cpuc)
> @@ -4548,39 +4441,6 @@ static __init void intel_nehalem_quirk(void)
> }
> }
>
> -static const struct x86_cpu_desc counter_freezing_ucodes[] = {
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 2, 0x0000000e),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 9, 0x0000002e),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 10, 0x00000008),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D, 1, 0x00000028),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 1, 0x00000028),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 8, 0x00000006),
> - {}
> -};
> -
> -static bool intel_counter_freezing_broken(void)
> -{
> - return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
> -}
> -
> -static __init void intel_counter_freezing_quirk(void)
> -{
> - /* Check if it's already disabled */
> - if (disable_counter_freezing)
> - return;
> -
> - /*
> - * If the system starts with the wrong ucode, leave the
> - * counter-freezing feature permanently disabled.
> - */
> - if (intel_counter_freezing_broken()) {
> - pr_info("PMU counter freezing disabled due to CPU errata,"
> - "please upgrade microcode\n");
> - x86_pmu.counter_freezing = false;
> - x86_pmu.handle_irq = intel_pmu_handle_irq;
> - }
> -}
> -
> /*
> * enable software workaround for errata:
> * SNB: BJ122
> @@ -4966,9 +4826,6 @@ __init int intel_pmu_init(void)
> max((int)edx.split.num_counters_fixed, assume);
> }
>
> - if (version >= 4)
> - x86_pmu.counter_freezing = !disable_counter_freezing;
> -
> if (boot_cpu_has(X86_FEATURE_PDCM)) {
> u64 capabilities;
>
> @@ -5090,7 +4947,6 @@ __init int intel_pmu_init(void)
>
> case INTEL_FAM6_ATOM_GOLDMONT:
> case INTEL_FAM6_ATOM_GOLDMONT_D:
> - x86_add_quirk(intel_counter_freezing_quirk);
> memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
> sizeof(hw_cache_event_ids));
> memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
> @@ -5117,7 +4973,6 @@ __init int intel_pmu_init(void)
> break;
>
> case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
> - x86_add_quirk(intel_counter_freezing_quirk);
> memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
> sizeof(hw_cache_event_ids));
> memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
> @@ -5577,13 +5432,6 @@ __init int intel_pmu_init(void)
> pr_cont("full-width counters, ");
> }
>
> - /*
> - * For arch perfmon 4 use counter freezing to avoid
> - * several MSR accesses in the PMI.
> - */
> - if (x86_pmu.counter_freezing)
> - x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> -
> if (x86_pmu.intel_cap.perf_metrics)
> x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 10032f023fcc..4084fa31cc21 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -681,8 +681,7 @@ struct x86_pmu {
>
> /* PMI handler bits */
> unsigned int late_ack :1,
> - enabled_ack :1,
> - counter_freezing :1;
> + enabled_ack :1;
> /*
> * sysfs attrs
> */