Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?)

From: Jann Horn
Date: Fri Feb 23 2024 - 14:38:54 EST


On Fri, Feb 23, 2024 at 4:52 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> On 2024-02-22 3:07 p.m., Jann Horn wrote:
> > On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Jann,
> >>
> >> Sorry for the late response.
> >>
> >> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote:
> >>> Just adding Joe Mario to the CC list.
> >>>
> >>> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
> >>>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
> >>>>> (and newer) because Intel added some feature where *clean* cachelines
> >>>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
> >>>>> treats this mostly the same as snoop-forwarding of modified cache
> >>>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
> >>>>> rodata section in "perf c2c report".
> >>>>>
> >>>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
> >>>>> Facility", table "Table 20-101. Data Source Encoding for Memory
> >>>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
> >>>>> "XCORE FWD. This request was satisfied by a sibling core where either
> >>>>> a modified (cross-core HITM) or a non-modified (cross-core FWD)
> >>>>> cache-line copy was found."
> >>>>>
> >>>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I
> >>>>> understand correctly, the kernel's PEBS data source decoding assumes
> >>>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
> >>>>> to be adjusted somehow - and maybe it just isn't possible to actually
> >>>>> distinguish between HitM and cross-core FWD in PEBS events on these
> >>>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
> >>>>> Intel can clarify?
> >>>>>
> >>>>> (The SDM describes that E-cores on the newer 12th Gen have more
> >>>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3
> >>>>> HITF"; but I guess the P-cores there maybe still don't let you
> >>>>> distinguish HITM/HITF?)
> >>
> >> Right, there is no way to distinguish HITM/HITF on Tiger Lake.
> >
> > Aah, okay, thank you very much for the clarification!
> >
> >> I think what we can do is to add both HITM and HITF for the 0x07 to
> >> match the SDM description.
> >>
> >> How about the below patch (not tested yet)?
> >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> >> index d49d661ec0a7..8c966b5b23cb 100644
> >> --- a/arch/x86/events/intel/ds.c
> >> +++ b/arch/x86/events/intel/ds.c
> >> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = {
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit,
> >> snoop miss */
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit,
> >> snoop hit */
> >> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit,
> >> snoop hitm */
> >> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /*
> >> 0x07: L3 hit, snoop hitm & fwd */
> >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08:
> >> L3 miss snoop hit */
> >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09:
> >> L3 miss snoop hitm*/
> >> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a:
> >> L3 miss, shared */
> >
> > (I'm not familiar enough with the perf semantics to know how the event
> > encoding works, maybe someone else can have a look?)
> >
>
> I can do the test to verify the settings and perf c2c. But I don't have
> a benchmark. Could you please share your benchmark with me?
> For example, the data you used in your example.
> # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c
> 100 --data

It seems to be happening at a low rate in the background when I'm just
clicking around on websites or such; but it seems like compiling the
kernel with "make -j8" (where 8 is the number of hyperthreads my
Tiger Lake laptop has) causes it to happen at a somewhat higher rate,
a few times per second.

Sorry, I don't really have a particularly good microbenchmark or such
that makes this happen at an abnormally high rate...