v2 of comments on Performance Counters for Linux (PCL)

From: stephane eranian
Date: Tue Jun 16 2009 - 13:42:51 EST

Next message: Arnd Bergmann: "Re: PowerPC PCI DMA issues (prefetch/coherency?)"
Previous message: Al Viro: "Re: [git pull] vfs patches, part 1"
Next in thread: Ingo Molnar: "Re: v2 of comments on Performance Counters for Linux (PCL)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Here is an updated version of my comments on PCL. Compared to the
previous version,
I have removed all the issues that were fixed or clarified. I have
kept all the issues and
open questions which I think are not solved yet and I added a few more.

I/ General API comments

Â1/ System calls

Â Â Â* ioctl()

Â Â Â ÂYou have defined 5 ioctls() so far to operate on an existing event.
Â Â Â ÂI was under the impression that ioctl() should not be used except for
Â Â Â Âdrivers.

Â Â Â ÂHow do you justify your usage of ioctl() in this context?

Â2/ Grouping

Â Â Â ÂBy design, an event can only be part of one group at a time. Events in
Â Â Â Âa group are guaranteed to be active on the PMU at the same time. That
Â Â Â Âmeans a group cannot have more events than there are available counters
Â Â Â Âon the PMU. Tools may want to know the number of counters available in
Â Â Â Âorder to group their events accordingly, such that reliable ratios
Â Â Â Âcould be computed. It seems the only way to know this is by trial and
Â Â Â Âerror. This is not practical.

Â3/ Multiplexing and system-wide

Â Â Â ÂMultiplexing is time-based and it is hooked into the timer tick. At
Â Â Â Âevery tick, the kernel tries to schedule another set of event groups.

Â Â Â ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
Â Â Â Âtherefore no multiplexing occurs. This is incorrect. It's not because
Â Â Â Âthe CPU is idle, that there aren't any interesting PMU events to measure.
Â Â Â ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
Â Â Â Âit is expected that multiplexing still happens.

Â Â Â ÂYou need to hook up the timer source for multiplexing to something else
Â Â Â Âwhich is not affected by tickless. You cannot simply disable tickless
Â Â Â Âduring a measurement because you would not be measuring the system as
Â Â Â Âit actually behaves.

Â 4/ Controlling group multiplexing

Â Â Â ÂAlthough multiplexing is exposed to users via the timing information,
Â Â Â Âevents may not necessarily be grouped at random by tools. Groups may
Â Â Â Ânot be ordered at random either.

Â Â Â ÂI know of tools which craft the sequence of groups carefully such that
Â Â Â Ârelated events are in neighboring groups such that they measure similar
Â Â Â Âparts of the execution. This way, you can mitigate the fluctuations
Â Â Â Âintroduced by multiplexing. In other words, some tools may want to
Â Â Â Âcontrol the order in which groups are scheduled on the PMU.

Â Â Â ÂYou mentioned that groups are multiplexed in creation order. But which
Â Â Â Âcreation order? As far as I know, multiple distinct tools may be
Â Â Â Âattaching to the same thread at the same time and their groups may be
Â Â Â Âinterleaved in the list. Therefore, I believe 'creation order' refers
Â Â Â Âto the global group creation order which is only visible to the kernel.
Â Â Â ÂEach tool may see a different order. Let's take an example.

Â Â Â ÂTool A creates group G1, G2, G3 and attaches them to thread T0. At the
Â Â Â Âsame time tool B creates group G4, G5. The actual global order may
Â Â Â Âbe: G1, G4, G2, G5, G3. This is what the kernel is going to multiplex.
Â Â Â ÂEach group will be multiplexed in the right order from the point of view
Â Â Â Âof each tool. But there will be gaps. It would be nice to have a way
Â Â Â Âto ensure that the sequence is either: G1, G2, G3, G4, G5 or G4, G5,
Â Â Â ÂG1, G2, G3. In other words, avoid the interleaving.

Â 5/ Mmaped count

Â Â Â ÂIt is possible to read counts directly from user space for
self-monitoring
Â Â Â Âthreads. This leverages a HW capability present on some processors. On
Â Â Â ÂX86, this is possible via RDPMC.

Â Â Â ÂThe full 64-bit count is constructed by combining the hardware value
Â Â Â Âextracted with an assembly instruction and a base value made available
Â Â Â Âthru the mmap. There is an atomic generation count available to deal
Â Â Â Âwith the race condition.

Â Â Â ÂI believe there is a problem with this approach given that the PMU
Â Â Â Âis shared and that events can be multiplexed. That means that even
Â Â Â Âthough you are self-monitoring, events get replaced on the PMU. The
Â Â Â Âassembly instruction is unaware of that, it reads a register
not an event.

Â Â Â ÂOn x86, assume event A is hosted in counter 0, thus you need RDPMC(0)
Â Â Â Âto extract the count. But then, the event is replaced by another one
Â Â Â Âwhich reuses counter 0. At the user level, you will still use RDPMC(0)
Â Â Â Âbut it will read the HW value from a different event and combine it
Â Â Â Âwith a base count from another one.

Â Â Â ÂTo avoid this, you need to pin the event so it stays in the PMU at
Â Â Â Âall times. Now, here is something unclear to me. Pinning does not
Â Â Â Âmean stay in the SAME register, it means the event stays on the PMU
Â Â Â Âbut it can possibly change register. To prevent that, I believe you need
Â Â Â Âto also set exclusive so that no other group can be scheduled, and thus
Â Â Â Âpossibly use the same counter.

Â Â Â ÂLooks like this is the only way you can make this actually work.
Â Â Â ÂNot setting pinned+exclusive, is another pitfall in which many people
Â Â Â Âwill fall into.

Â 6/ Group scheduling

Â Â Â ÂLooking at the existing code, it seems to me there is a risk of
Â Â Â Âstarvation for groups, i.e., groups never scheduled on the PMU.

Â Â Â ÂMy understanding of the scheduling algorithm is:

Â Â Â Â Â Â Â Â- first try to Âschedule pinned groups. If a pinned group
Â Â Â Â Â Â Â Â Âfails, put it in error mode. read() will fail until the
Â Â Â Â Â Â Â Â Âgroup gets another chance at being scheduled.

Â Â Â Â Â Â Â Â- then try to schedule the remaining groups. If a group fails
Â Â Â Â Â Â Â Â Âjust skip it.

Â Â Â ÂIf the group list does not change, then certain groups may always fail.
Â Â Â ÂHowever, the ordering of the list changes because at every tick, it is
Â Â Â Ârotated. The head becomes the tail. Therefore, each group eventually gets
Â Â Â Âthe first position and therefore gets the full PMU to assign its events.

Â Â Â ÂThis works as long as there is a guarantee the list will ALWAYS
rotate. If
Â Â Â Âa thread does not run long enough for a tick, it may never rotate.

Â 7/ Group validity checking

Â Â Â ÂAt the user level, an application is only concerned with events
and grouping
Â Â Â Âof those events. The assignment logic is performed by the kernel.

Â Â Â ÂFor a group to be scheduled, all its events must be compatible
with each other,
Â Â Â Âotherwise the group will never be scheduled. It is not clear to
me when that
Â Â Â Âsanity check will be performed if I create the group such that
it is stopped.

Â Â Â ÂIf the group goes all the way to scheduling, it will never be
scheduled. Counts
Â Â Â Âwill be zero and the users will have no idea why. If the group
is put in error
Â Â Â Âstate, read will not be possible. But again, how will the user know why?

Â 8/ Generalized cache events

Â Â Â In recent days, you have added support for what you call
'generalized cache events'.

Â Â Â The log defines:
Â Â Â Â Â Â Â Ânew event type: PERF_TYPE_HW_CACHE

Â Â Â Â Â Â Â ÂThis is a 3-dimensional space:
Â Â Â Â Â Â Â Â{ L1-D, L1-I, L2, ITLB, DTLB, BPU } x
Â Â Â Â Â Â Â Â{ load, store, prefetch } x
Â Â Â Â Â Â Â Â{ accesses, misses }

Â Â Â Those generic events are then mapped by the kernel onto actual
PMU events if possible.

Â Â Â I don't see any justification for adding this and especially in
the kernel.

Â Â Â What's the motivation and goal of this?

Â Â Â If you define generic events, you need to provide a clear
definition of what they are
Â Â Â actually measuring. This is especially true for caches because
there are many cache
Â Â Â events and many different behaviors.

Â Â Â If the goal is to make comparisons easier. I believe this is
doomed to fail. Because
Â Â Â different caches behave differently, events capture different
subtle things, e.g, HW
Â Â Â prefetch vs. sw prefetch. If to actually understand what the
generic event is counting
Â Â Â I need to know the mapping, then this whole feature is useless.

Â 9/ Group reading

Â Â Â It is possible to start/stop an event group simply via ioctl()
on the group
Â Â Â leader. However, it is not possible to read all the counts with a single
Â Â Â with a single read() system call. That seems odd. Furhermore, I
believe you
Â Â Â want reads to be as atomic as possible.

Â 10/ Event buffer minimal useful size

Â Â Â As it stands, the buffer header occupies the first page, even though the
Â Â Â buffer header struct is 32-byte long. That's a lot of precious
RLIMIT_MEMLOCK
Â Â Â memory wasted.

Â Â Â The actual buffer (data) starts at the next page (from builtin-top.c):

Â Â Â Âstatic void mmap_read_counter(struct mmap_data *md)
Â Â Â Â{
Â Â Â Â Â Â Â Âunsigned int head = mmap_read_head(md);
Â Â Â Â Â Â Â Âunsigned int old = md->prev;
Â Â Â Â Â Â Â Âunsigned char *data = md->base + page_size;

Â Â Â ÂGiven that the buffer "full" notification are sent on page
crossing boundaries,
Â Â Â Âif the actual buffer payload size is 1 page, you are guaranteed
to have your
Â Â Â Âsamples overwritten.

Â Â Â ÂThis leads me to believe that the minimal buffer size to get
useful data is 3 pages.
Â Â Â ÂThis is per event group per thread. That puts a lot of pressure
on RLIMIT_MEMLOCK
Â Â Â Âwhich is ususally set fairly low by distros.

Â Â11/ Missing definitions for generic hardware events

Â Â Â ÂAs soon as you define generic events, you need to provide a
clear and precise definition
Â Â Â Âat to what they measure. This is crucial to make them useful. I
have not seen such a
Â Â Â Âdefinition yet.

II/ X86 comments

Â 1/ Fixed counters on Intel

Â Â Â ÂYou cannot simply fall back to generic counters if you cannot find
Â Â Â Âa fixed counter. There are model-specific bugs, for instance
Â Â Â ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
Â Â Â ÂNehalem when it is used in fixed counter 2 or a generic counter. The
Â Â Â Âsame is true on Core.

Â Â Â ÂYou cannot simply look at the event field code to determine whether
Â Â Â Âthis is an event supported by a fixed counters. You must look at the
Â Â Â Âother fields such as edge, invert, cnt-mask. If those are present then
Â Â Â Âyou have to fall back to using a generic counter as fixed counters only
Â Â Â Âsupport priv level filtering. As indicated above, though, programming
Â Â Â ÂUNHALTED_REFERENCE_CYCLES on a generic counter does not count the same
Â Â Â Âthing, therefore you need to fail if filters other than priv levels are
Â Â Â Âpresent on this event.

Â 2/ Event knowledge missing

Â Â Â ÂThere are constraints on events in Intel processors. Different
constraints
Â Â Â Âdo exist on AMD64 processors, especially with uncore-releated events.

Â Â Â ÂIn your model, those need to be taken care of by the kernel. Should the
Â Â Â Âkernel make the wrong decision, there would be no work-around for user
Â Â Â Âtools. Take the example I outlined just above with Intel fixed counters.

Â Â Â ÂThe current code-base does not have any constrained event
support, therefore
Â Â Â Âbogus counts may be returned depending on the event measured.

III/ Requests

Â 1/ Sampling period randomization

Â Â Â ÂIt is our experience (on Itanium, for instance), that for certain
Â Â Â Âsampling measurements, it is beneficial to randomize the sampling
Â Â Â Âperiod a bit. This is in particular the case when sampling on an
Â Â Â Âevent that happens very frequently and which is not related to
Â Â Â Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
Â Â Â Âthe bias. You do not need something sophisticated. But when you are using
Â Â Â Âa kernel-level sampling buffer, you need to have the kernel randomize.
Â Â Â ÂRandomization needs to be supported per event.

IV/ Open questions

Â 1/ Support for model-specific uncore PMU monitoring capabilities

Â Â Â ÂRecent processors have multiple PMUs. Typically one per core and but
Â Â Â Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
Â Â Â Âthis API will provide access to these PMU as well.

Â Â Â ÂIt seems like with the current API, raw events for those PMU would need
Â Â Â Âa new architecture-specific type as the event encoding by itself may
Â Â Â Ânot be enough to disambiguate between a core and uncore PMU event.

Â Â Â ÂHow are those events going to be supported?

Â 2/ Features impacting all counters

Â Â Â ÂOn some PMU models, e.g., Itanium, they are certain features which have
Â Â Â Âan influence on all counters that are active. For instance, there is a
Â Â Â Âway to restrict monitoring to a range of continuous code or data
Â Â Â Âaddresses using both some PMU registers and the debug registers.

Â Â Â ÂGiven that the API exposes events (counters) as independent of each
Â Â Â Âother, I wonder how range restriction could be implemented.

Â Â Â ÂSimilarly, on Itanium, there are global behaviors. For instance, on
Â Â Â Âcounter overflow the entire PMU freezes all at once. That seems to be
Â Â Â Âcontradictory with the design of the API which creates the illusion of
Â Â Â Âindependence.

Â Â Â ÂWhat solutions do you propose?

Â 3/ AMD IBS

Â Â Â ÂHow is AMD IBS going to be implemented?

Â Â Â ÂIBS has two separate sets of registers. One to capture fetch related
Â Â Â Âdata and another one to capture instruction execution data. For each,
Â Â Â Âthere is one config register but multiple data registers. In each mode,
Â Â Â Âthere is a specific sampling period and IBS can interrupt.

Â Â Â ÂIt looks like you could define two pseudo events or event types and then
Â Â Â Âdefine a new record_format and read_format. ÂThat formats would only be
Â Â Â Âvalid for an IBS event.

Â Â Â ÂIs that how you intend to support IBS?

Â 4/ Intel PEBS

Â Â Â ÂSince Netburst-based processors, Intel PMUs support a hardware sampling
Â Â Â Âbuffer mechanism called PEBS.

Â Â Â ÂPEBS really became useful with Nehalem.

Â Â Â ÂNot all events support PEBS. Up until Nehalem, only one counter supported
Â Â Â ÂPEBS (PMC0). The format of the hardware buffer has changed between Core
Â Â Â Âand Nehalem. It is not yet architected, thus it can still evolve with
Â Â Â Âfuture PMU models.

Â Â Â ÂOn Nehalem, there is a new PEBS-based feature called Load Latency
Â Â Â ÂFiltering which captures where data cache misses occur
Â Â Â Â(similar to Itanium D-EAR). Activating this feature requires setting a
Â Â Â Âlatency threshold hosted in a separate PMU MSR.

Â Â Â ÂOn Nehalem, given that all 4 generic counters support PEBS, the
Â Â Â Âsampling buffer may contain samples generated by any of the 4 counters.
Â Â Â ÂThe buffer includes a bitmask of registers to determine the source
Â Â Â Âof the samples. Multiple bits may be set in the bitmask.

Â Â Â ÂHow PEBS will be supported for this new API?

Â 5/ Intel Last Branch Record (LBR)

Â Â Â ÂIntel processors since Netburst have a cyclic buffer hosted in
Â Â Â Âregisters which can record taken branches. Each taken branch is stored
Â Â Â Âinto a pair of LBR registers (source, destination). Up until Nehalem,
Â Â Â Âthere was not filtering capabilities for LBR. LBR is not an architected
Â Â Â ÂPMU feature.

Â Â Â ÂThere is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
Â Â Â ÂHowever there are some constraints on it given it is shared by threads.

Â Â Â ÂLBR is only useful when sampling and therefore must be combined with a
Â Â Â Âcounter. LBR must also be configured to freeze on PMU interrupt.

Â Â Â ÂHow is LBR going to be supported?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnd Bergmann: "Re: PowerPC PCI DMA issues (prefetch/coherency?)"
Previous message: Al Viro: "Re: [git pull] vfs patches, part 1"
Next in thread: Ingo Molnar: "Re: v2 of comments on Performance Counters for Linux (PCL)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]