v2 of comments on Performance Counters for Linux (PCL)

From: stephane eranian
Date: Tue Jun 16 2009 - 13:42:51 EST


Hi,

Here is an updated version of my comments on PCL. Compared to the
previous version,
I have removed all the issues that were fixed or clarified. I have
kept all the issues and
open questions which I think are not solved yet and I added a few more.


I/ General API comments

Â1/ System calls

  Â* ioctl()

   ÂYou have defined 5 ioctls() so far to operate on an existing event.
   ÂI was under the impression that ioctl() should not be used except for
   Âdrivers.

   ÂHow do you justify your usage of ioctl() in this context?

Â2/ Grouping

   ÂBy design, an event can only be part of one group at a time. Events in
   Âa group are guaranteed to be active on the PMU at the same time. That
   Âmeans a group cannot have more events than there are available counters
   Âon the PMU. Tools may want to know the number of counters available in
   Âorder to group their events accordingly, such that reliable ratios
   Âcould be computed. It seems the only way to know this is by trial and
   Âerror. This is not practical.

Â3/ Multiplexing and system-wide

   ÂMultiplexing is time-based and it is hooked into the timer tick. At
   Âevery tick, the kernel tries to schedule another set of event groups.

   ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
   Âtherefore no multiplexing occurs. This is incorrect. It's not because
   Âthe CPU is idle, that there aren't any interesting PMU events to measure.
   ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
   Âit is expected that multiplexing still happens.

   ÂYou need to hook up the timer source for multiplexing to something else
   Âwhich is not affected by tickless. You cannot simply disable tickless
   Âduring a measurement because you would not be measuring the system as
   Âit actually behaves.

 4/ Controlling group multiplexing

   ÂAlthough multiplexing is exposed to users via the timing information,
   Âevents may not necessarily be grouped at random by tools. Groups may
   Ânot be ordered at random either.

   ÂI know of tools which craft the sequence of groups carefully such that
   Ârelated events are in neighboring groups such that they measure similar
   Âparts of the execution. This way, you can mitigate the fluctuations
   Âintroduced by multiplexing. In other words, some tools may want to
   Âcontrol the order in which groups are scheduled on the PMU.

   ÂYou mentioned that groups are multiplexed in creation order. But which
   Âcreation order? As far as I know, multiple distinct tools may be
   Âattaching to the same thread at the same time and their groups may be
   Âinterleaved in the list. Therefore, I believe 'creation order' refers
   Âto the global group creation order which is only visible to the kernel.
   ÂEach tool may see a different order. Let's take an example.

   ÂTool A creates group G1, G2, G3 and attaches them to thread T0. At the
   Âsame time tool B creates group G4, G5. The actual global order may
   Âbe: G1, G4, G2, G5, G3. This is what the kernel is going to multiplex.
   ÂEach group will be multiplexed in the right order from the point of view
   Âof each tool. But there will be gaps. It would be nice to have a way
   Âto ensure that the sequence is either: G1, G2, G3, G4, G5 or G4, G5,
   ÂG1, G2, G3. In other words, avoid the interleaving.

 5/ Mmaped count

   ÂIt is possible to read counts directly from user space for
self-monitoring
   Âthreads. This leverages a HW capability present on some processors. On
   ÂX86, this is possible via RDPMC.

   ÂThe full 64-bit count is constructed by combining the hardware value
   Âextracted with an assembly instruction and a base value made available
   Âthru the mmap. There is an atomic generation count available to deal
   Âwith the race condition.

   ÂI believe there is a problem with this approach given that the PMU
   Âis shared and that events can be multiplexed. That means that even
   Âthough you are self-monitoring, events get replaced on the PMU. The
   Âassembly instruction is unaware of that, it reads a register
not an event.

   ÂOn x86, assume event A is hosted in counter 0, thus you need RDPMC(0)
   Âto extract the count. But then, the event is replaced by another one
   Âwhich reuses counter 0. At the user level, you will still use RDPMC(0)
   Âbut it will read the HW value from a different event and combine it
   Âwith a base count from another one.

   ÂTo avoid this, you need to pin the event so it stays in the PMU at
   Âall times. Now, here is something unclear to me. Pinning does not
   Âmean stay in the SAME register, it means the event stays on the PMU
   Âbut it can possibly change register. To prevent that, I believe you need
   Âto also set exclusive so that no other group can be scheduled, and thus
   Âpossibly use the same counter.

   ÂLooks like this is the only way you can make this actually work.
   ÂNot setting pinned+exclusive, is another pitfall in which many people
   Âwill fall into.

 6/ Group scheduling

   ÂLooking at the existing code, it seems to me there is a risk of
   Âstarvation for groups, i.e., groups never scheduled on the PMU.

   ÂMy understanding of the scheduling algorithm is:

       Â- first try to Âschedule pinned groups. If a pinned group
        Âfails, put it in error mode. read() will fail until the
        Âgroup gets another chance at being scheduled.

       Â- then try to schedule the remaining groups. If a group fails
        Âjust skip it.

   ÂIf the group list does not change, then certain groups may always fail.
   ÂHowever, the ordering of the list changes because at every tick, it is
   Ârotated. The head becomes the tail. Therefore, each group eventually gets
   Âthe first position and therefore gets the full PMU to assign its events.

   ÂThis works as long as there is a guarantee the list will ALWAYS
rotate. If
   Âa thread does not run long enough for a tick, it may never rotate.

 7/ Group validity checking

   ÂAt the user level, an application is only concerned with events
and grouping
   Âof those events. The assignment logic is performed by the kernel.

   ÂFor a group to be scheduled, all its events must be compatible
with each other,
   Âotherwise the group will never be scheduled. It is not clear to
me when that
   Âsanity check will be performed if I create the group such that
it is stopped.

   ÂIf the group goes all the way to scheduling, it will never be
scheduled. Counts
   Âwill be zero and the users will have no idea why. If the group
is put in error
   Âstate, read will not be possible. But again, how will the user know why?


 8/ Generalized cache events

   In recent days, you have added support for what you call
'generalized cache events'.

   The log defines:
       Ânew event type: PERF_TYPE_HW_CACHE

       ÂThis is a 3-dimensional space:
       Â{ L1-D, L1-I, L2, ITLB, DTLB, BPU } x
       Â{ load, store, prefetch } x
       Â{ accesses, misses }

   Those generic events are then mapped by the kernel onto actual
PMU events if possible.

   I don't see any justification for adding this and especially in
the kernel.

   What's the motivation and goal of this?

   If you define generic events, you need to provide a clear
definition of what they are
   actually measuring. This is especially true for caches because
there are many cache
   events and many different behaviors.

   If the goal is to make comparisons easier. I believe this is
doomed to fail. Because
   different caches behave differently, events capture different
subtle things, e.g, HW
   prefetch vs. sw prefetch. If to actually understand what the
generic event is counting
   I need to know the mapping, then this whole feature is useless.

 9/ Group reading

   It is possible to start/stop an event group simply via ioctl()
on the group
   leader. However, it is not possible to read all the counts with a single
   with a single read() system call. That seems odd. Furhermore, I
believe you
   want reads to be as atomic as possible.

 10/ Event buffer minimal useful size

   As it stands, the buffer header occupies the first page, even though the
   buffer header struct is 32-byte long. That's a lot of precious
RLIMIT_MEMLOCK
   memory wasted.

   The actual buffer (data) starts at the next page (from builtin-top.c):

   Âstatic void mmap_read_counter(struct mmap_data *md)
   Â{
       Âunsigned int head = mmap_read_head(md);
       Âunsigned int old = md->prev;
       Âunsigned char *data = md->base + page_size;


   ÂGiven that the buffer "full" notification are sent on page
crossing boundaries,
   Âif the actual buffer payload size is 1 page, you are guaranteed
to have your
   Âsamples overwritten.

   ÂThis leads me to believe that the minimal buffer size to get
useful data is 3 pages.
   ÂThis is per event group per thread. That puts a lot of pressure
on RLIMIT_MEMLOCK
   Âwhich is ususally set fairly low by distros.

 Â11/ Missing definitions for generic hardware events

   ÂAs soon as you define generic events, you need to provide a
clear and precise definition
   Âat to what they measure. This is crucial to make them useful. I
have not seen such a
   Âdefinition yet.

II/ X86 comments

 1/ Fixed counters on Intel

   ÂYou cannot simply fall back to generic counters if you cannot find
   Âa fixed counter. There are model-specific bugs, for instance
   ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
   ÂNehalem when it is used in fixed counter 2 or a generic counter. The
   Âsame is true on Core.

   ÂYou cannot simply look at the event field code to determine whether
   Âthis is an event supported by a fixed counters. You must look at the
   Âother fields such as edge, invert, cnt-mask. If those are present then
   Âyou have to fall back to using a generic counter as fixed counters only
   Âsupport priv level filtering. As indicated above, though, programming
   ÂUNHALTED_REFERENCE_CYCLES on a generic counter does not count the same
   Âthing, therefore you need to fail if filters other than priv levels are
   Âpresent on this event.

 2/ Event knowledge missing

   ÂThere are constraints on events in Intel processors. Different
constraints
   Âdo exist on AMD64 processors, especially with uncore-releated events.

   ÂIn your model, those need to be taken care of by the kernel. Should the
   Âkernel make the wrong decision, there would be no work-around for user
   Âtools. Take the example I outlined just above with Intel fixed counters.

   ÂThe current code-base does not have any constrained event
support, therefore
   Âbogus counts may be returned depending on the event measured.

III/ Requests

 1/ Sampling period randomization

   ÂIt is our experience (on Itanium, for instance), that for certain
   Âsampling measurements, it is beneficial to randomize the sampling
   Âperiod a bit. This is in particular the case when sampling on an
   Âevent that happens very frequently and which is not related to
   Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
   Âthe bias. You do not need something sophisticated. But when you are using
   Âa kernel-level sampling buffer, you need to have the kernel randomize.
   ÂRandomization needs to be supported per event.

IV/ Open questions

 1/ Support for model-specific uncore PMU monitoring capabilities

   ÂRecent processors have multiple PMUs. Typically one per core and but
   Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
   Âthis API will provide access to these PMU as well.

   ÂIt seems like with the current API, raw events for those PMU would need
   Âa new architecture-specific type as the event encoding by itself may
   Ânot be enough to disambiguate between a core and uncore PMU event.

   ÂHow are those events going to be supported?

 2/ Features impacting all counters

   ÂOn some PMU models, e.g., Itanium, they are certain features which have
   Âan influence on all counters that are active. For instance, there is a
   Âway to restrict monitoring to a range of continuous code or data
   Âaddresses using both some PMU registers and the debug registers.

   ÂGiven that the API exposes events (counters) as independent of each
   Âother, I wonder how range restriction could be implemented.

   ÂSimilarly, on Itanium, there are global behaviors. For instance, on
   Âcounter overflow the entire PMU freezes all at once. That seems to be
   Âcontradictory with the design of the API which creates the illusion of
   Âindependence.

   ÂWhat solutions do you propose?

 3/ AMD IBS

   ÂHow is AMD IBS going to be implemented?

   ÂIBS has two separate sets of registers. One to capture fetch related
   Âdata and another one to capture instruction execution data. For each,
   Âthere is one config register but multiple data registers. In each mode,
   Âthere is a specific sampling period and IBS can interrupt.

   ÂIt looks like you could define two pseudo events or event types and then
   Âdefine a new record_format and read_format. ÂThat formats would only be
   Âvalid for an IBS event.

   ÂIs that how you intend to support IBS?

 4/ Intel PEBS

   ÂSince Netburst-based processors, Intel PMUs support a hardware sampling
   Âbuffer mechanism called PEBS.

   ÂPEBS really became useful with Nehalem.

   ÂNot all events support PEBS. Up until Nehalem, only one counter supported
   ÂPEBS (PMC0). The format of the hardware buffer has changed between Core
   Âand Nehalem. It is not yet architected, thus it can still evolve with
   Âfuture PMU models.

   ÂOn Nehalem, there is a new PEBS-based feature called Load Latency
   ÂFiltering which captures where data cache misses occur
   Â(similar to Itanium D-EAR). Activating this feature requires setting a
   Âlatency threshold hosted in a separate PMU MSR.

   ÂOn Nehalem, given that all 4 generic counters support PEBS, the
   Âsampling buffer may contain samples generated by any of the 4 counters.
   ÂThe buffer includes a bitmask of registers to determine the source
   Âof the samples. Multiple bits may be set in the bitmask.


   ÂHow PEBS will be supported for this new API?

 5/ Intel Last Branch Record (LBR)

   ÂIntel processors since Netburst have a cyclic buffer hosted in
   Âregisters which can record taken branches. Each taken branch is stored
   Âinto a pair of LBR registers (source, destination). Up until Nehalem,
   Âthere was not filtering capabilities for LBR. LBR is not an architected
   ÂPMU feature.

   ÂThere is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
   ÂHowever there are some constraints on it given it is shared by threads.

   ÂLBR is only useful when sampling and therefore must be combined with a
   Âcounter. LBR must also be configured to freeze on PMU interrupt.

   ÂHow is LBR going to be supported?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/