Re: IV.4 - Intel PEBS

From: Ingo Molnar
Date: Mon Jun 22 2009 - 08:01:32 EST


> 4/ Intel PEBS
>
> Since Netburst-based processors, Intel PMUs support a hardware
> sampling buffer mechanism called PEBS.
>
> PEBS really became useful with Nehalem.
>
> Not all events support PEBS. Up until Nehalem, only one counter
> supported PEBS (PMC0). The format of the hardware buffer has
> changed between Core and Nehalem. It is not yet architected, thus
> it can still evolve with future PMU models.
>
> On Nehalem, there is a new PEBS-based feature called Load Latency
> Filtering which captures where data cache misses occur (similar to
> Itanium D-EAR). Activating this feature requires setting a latency
> threshold hosted in a separate PMU MSR.
>
> On Nehalem, given that all 4 generic counters support PEBS, the
> sampling buffer may contain samples generated by any of the 4
> counters. The buffer includes a bitmask of registers to determine
> the source of the samples. Multiple bits may be set in the
> bitmask.
>
> How PEBS will be supported for this new API?

Note, the relevance of PEBS (or IBS) should not be over-stated: for
example it fundamentally cannot do precise call-chain recording (it
only records the RIP, not any of the return frames), which removes
from its utility. Another limitation is that only a few basic
hardware event types are supported by PEBS.

Having said that, PEBS is a hardware sampling feature that is
definitely saner than AMD's IBS. There's two immediate incremental
uses of it in perfcounters:

- it makes flat sampling lower overhead by avoiding an NMI for all
sample points.

- it makes flat sampled data more precise. (I.e. it can avoid the
1-2 instructions 'skidding' of a sample position, for a handful
of PEBS-capable events.)

As such its primary support form would be 'transparent enablement':
i.e. on those (relatively few) events that are PEBS supported it
would be enabled automatically, and would result in more precise
(and possibly, cheaper) samples.

No separate APIs are needed really - the kernel can abstract it away
and can provide the user what the user wants: good and fast samples.

Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data
Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS
records in the same stream of data. One possible model to support
them is to set the PEBS threshold to one, and hence generate an
interrupt for each PEBS record. At offset 0x90 of the PEBS record we
have a snapshot of the global status register:

0x90 IA32_PERF_GLOBAL_STATUS

Which tells us that relative to the previous PEBS record in the DS
which counter overflowed. If this were not reliable, we could still
poll all active counters for overflows and get a occasionally
imprecise but still statistically meaningful and precise
demultiplexing.

As to enabling PEBS with the (CPU-)global latency recording filters,
we can do this transparantly for every PEBS supported event, or can
mandate PEBS scheduling when a PEBS only feature like load latency
is requested.

This means that for most purposes PEBS will be transparant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/