Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updatedpatch

From: Maynard Johnson
Date: Thu Feb 08 2007 - 19:00:28 EST


Milton,
Thank you for your comments. Carl will reply to certain parts of your posting where he's more knowledgeable than I. See my replies below.

-Maynard

Milton Miller wrote:

On Feb 6, 2007, at 5:02 PM, Carl Love wrote:



This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".






[snip]


Data collected


The current patch starts tackling these translation issues for the
presently common case of a static self contained binary from a single
file, either single separate source file or embedded in the data of
the host application. When creating the trace entry for a SPU
context switch, it records the application owner, pid, tid, and
dcookie of the main executable. It addition, it looks up the
object-id as a virtual address and records the offset if it is non-zero,
or the dcookie of the object if it is zero. The code then creates
a data structure by reading the elf headers from the user process
(at the address given by the object-id) and building a list of
SPU address to elf object offsets, as specified by the ELF loader
headers. In addition to the elf loader section, it processes the
overlay headers and records the address, size, and magic number of
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset. If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay. The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled. There is no code to handle dynamic
linked libraries or overlays. Nor is there any method to


Yes, we do handle overlays. (Note: I'm looking into a bug right now in our overlay support.)

present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space. Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address. The context switch would record tid,
pid, object id as it does now. In addition, if this is a
new object-id, the kernel would read elf headers as it does
today. However, it would then proceed to provide accurate
dcookie information for each loader region and overlay. To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream. The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


Discussions on this topic in the past have resulted in the current implementation precisely because we're able to record the samples as fileoffsets, just as the userspace tools expect. I haven't had time to check out how much this would impact the userspace tools, but my gut feel is that it would be quite significant. If we were developing this module with a matching newly-created userspace tool, I would be more inclined to agree that this makes sense. But you give no rationale for your proposal that justifies the change. The current implementation works, it has no impact on normal, non-profiling behavior, and the overhead during profiling is not noticeable.

This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection. This allows the ambiguity to be
exposed to user space. In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.


Yes, there is a window here where an overlay could occur before we finish processing a group of samples that were actually taken from a different overlay. The obvious way to prevent that is for the kernel (or SPUFS) to be notified of the overlay and let OProfile know that we need to drain (perhaps discard would be best) our sample trace buffer. As you indicate above, your proposal faces the same issue, but would just decrease the number of bogus samples. I contend that the relative number of bogus samples will be quite low in either case. Ideally, we should have a mechanism to eliminate them completely so as to avoid confusion the user's part when they're looking at a report. Even a few bogus samples in the wrong place can be troubling. Such a mechanism will be a good future enhancement.

[snip]

milton
--
miltonm@xxxxxxx Milton Miller
Speaking for myself only.

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@xxxxxxxxxx
https://ozlabs.org/mailman/listinfo/linuxppc-dev




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/