Re: [patch] perf: ARMv7 wrong "branches" generalized instruction

From: Ingo Molnar
Date: Thu Aug 11 2011 - 04:16:51 EST



* Will Deacon <will.deacon@xxxxxxx> wrote:

> [...] From what I've seen of perf users on ARM, they start with the
> ABI events, get some nonsensical results and then switch
> exclusively to raw events from then on.

Could you give a specific example of such nonsensical output on ARM?
Bugs should be fixed and yes i can that see if ARM produces
nonsencial output then people won't use that nonsensical output
(duh). Please fix or improve the nonsensical output.

Btw., i have a pretty different experience from you: people will use
most of the (default) generic events pretty happily because most
developers have an adequate notion of 'cycles, branches,
instructions' and they will *STOP* at the boundary of having to go
into CPU microarchitecture specific details ...

People just use the tool defaults in most cases, only a select few
will bother with model specific events. Life is short and learning
CPU microarchitecture specific details is a long and difficult
process that is not justified for most users/developers - not in
small part because the juicy bits of how specific CPUs really work
(and what raw events correspond to those details) are behind an NDA
protected curtain, only accessible to a few privileged people ...

That is not what Linux interfaces are about in my opinion.

So what you and Vince are suggesting, to dumb down the kernel parts
of perf and force users into raw or microarchitecture specific events
actually *reduces* the user-base very significantly - while in
practice even just cycles, instructions and branches level analysis
handles 99% of the everyday performance analysis needs ...

We saw how the "push CPU specific events to users and tooling"
concept didn't work with oprofile - why do we have to re-discuss this
part of failed Linux history again and again?

The approach Vince and you are suggesting is literally sacrificing
99% of utility for 1% of the users - a not very smart approach. I
don't mind accomodating the needs of 1% of power-users (at all), but:

*NOT AT THE EXPENSE OF THE COMMON CASE*.

doh.

> > I agree 100%, but it's an unpopular opinion on linux-kernel.
> > (Note that I'm the one who contributed ARM Cortex A8/A9 support
> > to both libpfm4 and PAPI).
>
> I can see why it's an unpopular idea if it's not necessary on your
> architecture but for ARM it's really the only way forward without
> continuing to introduce a mess of sparsely populated event tables
> every time a new CPU crops up.

Generic events are not about lkml popularity ... it's about
usability.

And why would it have to be implemented in a messy way? We have a
number of CPU specific tables (and quirks) on x86 as well - that's
the job of pretty much any kernel driver, to abstract things away in
a per CPU, often per device (and sometimes even per card variant
type) manner.

We literally have more than 7 million lines of drivers/* code that
provides generic abstractions - not just a few thousand lines of raw
PCI operations space where user-space can write magic values to ...

Similarly, for perf events we don't do a raw binary ABI mess for
really good reasons: tools and users do not think in CPU and model
specific hexa numbers, they operate in higher level concepts.

That is a basic quality of implementation property.

It's the *job* of the kernel to abstract things away, we don't shy
away from that ...

> > Since the generalized events are there and ABI though, people are
> > going to use them. That's why I've been writing tests that check
> > them to see exactly what they are measuring.
>
> Right, but as I say, `instructions' on one core might not be
> `instructions' on another core. Just removing the ABI types from
> ARM will at least stop people using them. [...]

What are you talking about? Sure ARM Cortex 9 will execute
instructions of a user-space application just as much as do other ARM
CPUs. Sure as it executes that app it will execute instructions, you
can single-step through it and thus you can count how many
instructions it has executed, right?

> > It's still an important issue to know what "branches" measures,
> > just it probably shouldn't be a kernel issue like it's become.
>
> The TRM for the A9 will describe various events for counting
> branch-related events. These may be specific to the pipeline and
> micro-architecture and therefore you can't really tar them all with
> the same brush.

The best generic event is the one that the coder/user of a user-space
app sees the CPU executing instructions/branches/etc.

If the PMU cannot give that then the (statistically) next best
approximation should be provided.

If you think about it that is a pretty unambiguous definition: each
ARM core will execute user-space applications and the same
(compatible) assembly routine results in the same end result, in the
same number of visible assembly instructions, right?

In practice most people will use the default event: cycles for perf
stat/top and the default 'perf stat' output.

We've also had numerous cases where kernel developers went way beyond
those metrics and apprecitated that tooling would provide good
approximations for all those events regardless of what CPU type the
workload was running on (and sometimes even documented this in the
changelog).

So having generic events is not some fancy, unused property, but a
pretty important measurement aspect of perf.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/