Re: [PATCH v3 2/2] perf/sdt: Directly record SDT events with 'perf record'

From: Ravi Bangoria
Date: Tue Feb 28 2017 - 05:58:02 EST


Thanks Masami for the review.
Please find my replies to your comments.

On Tuesday 28 February 2017 11:15 AM, Masami Hiramatsu wrote:
> On Fri, 24 Feb 2017 13:13:25 +0530
> Ravi Bangoria <ravi.bangoria@xxxxxxxxxxxxxxxxxx> wrote:
>
>> After invoking 'perf record', behind the scenes, it checks whether
>> the event specified is an SDT event by using the flag '%' or string
>> 'sdt_'. After that, it first checks whether event already exists
>> with that *name* in uprobe_events file. If found, it records that
>> particular event. Otherwise, it does a lookup of the probe cache to
>> find out the SDT event. If its not present, it throws an error.
>> If found, it again tries to find existing events from uprobe_events
>> file, but this time it uses *address* and *filename* for comparison.
>> Finally it writes new events into the uprobe_events file and starts
>> recording. It also maintains a list of the event names that were
>> written to uprobe_events file for this session. After finishing the
>> record session, it removes the events from the uprobe_events file
>> using the maintained name list.
> OK, the behavior looks good. However, the implementation seems
> too complex, could you make it simpler?
> I have some comments on the code, see below.

Yes, it's complex code. It's all because of multiple SDT events with same name
exists and we wants to allow both -- 'perf probe' as well as 'perf record' -- which
creates complex situations that has to be taken care. See my replies below for
such examples.

>> +/*
>> + * Remove ith tev from pev->tevs list and shift remaining
>> + * tevs(i+1 to pev->ntevs) one step.
>> + */
>> +static void shift_pev(struct perf_probe_event *pev, int i)
>> +{
>> + int j;
>> +
>> + free(pev->tevs[i].event);
>> + free(pev->tevs[i].group);
>> + free(pev->tevs[i].args);
>> + free(pev->tevs[i].point.realname);
>> + free(pev->tevs[i].point.symbol);
>> + free(pev->tevs[i].point.module);
>> +
>> + /*
>> + * If ith element is last element, no need to shift,
>> + * just decrement pev->ntevs.
>> + */
>> + if (i == pev->ntevs - 1)
>> + goto ret;
>> +
>> + for (j = i; j < pev->ntevs - 1; j++) {
>> + pev->tevs[j].event = pev->tevs[j + 1].event;
>> + pev->tevs[j].group = pev->tevs[j + 1].group;
>> + pev->tevs[j].nargs = pev->tevs[j + 1].nargs;
>> + pev->tevs[j].uprobes = pev->tevs[j + 1].uprobes;
>> + pev->tevs[j].args = pev->tevs[j + 1].args;
>> + pev->tevs[j].point.realname = pev->tevs[j + 1].point.realname;
>> + pev->tevs[j].point.symbol = pev->tevs[j + 1].point.symbol;
>> + pev->tevs[j].point.module = pev->tevs[j + 1].point.module;
>> + pev->tevs[j].point.offset = pev->tevs[j + 1].point.offset;
>> + pev->tevs[j].point.address = pev->tevs[j + 1].point.address;
>> + pev->tevs[j].point.retprobe = pev->tevs[j + 1].point.retprobe;
>> + }
>> +
>> +ret:
>> + pev->ntevs--;
>> +}
> This code smells no good...
> Why don't you make a list of newly added probes ?
>

Yes. This is little ugly. Consider a following scenario.

$ readelf -n /usr/lib64/libpthread-2.24.so | grep -A2 Provider
Provider: libpthread
Name: mutex_entry
Location: 0x0000000000009ddb, Base: 0x00000000000139cc, ...
--
Provider: libpthread
Name: mutex_entry
Location: 0x000000000000bcbb, Base: 0x00000000000139cc, ...

$ perf probe sdt_libpthread:mutex_entry

$ cat /sys/kernel/debug/tracing/uprobe_events
p:sdt_libpthread/mutex_entry /usr/lib64/libpthread-2.24.so:0x0000000000009ddb
p:sdt_libpthread/mutex_entry_1 /usr/lib64/libpthread-2.24.so:0x000000000000bcbb

$ perf probe -d sdt_libpthread:mutex_entry

$ perf record -a -e sdt_libpthread:mutex_entry
/* This should record:
sdt_libpthread/mutex_entry <== New entry (delete at the end of session)
sdt_libpthread/mutex_entry_1 <== Reuse (do not delete at the end of session)
*/

First of all, perf looks into uprobe_events for entry with sdt_libpthread:mutex_entry
*name*. As it's not present, perf does a lookup in probe-cache and gets two entries.
Perf stores them in 'pev->tevs' array. Now it again compares uprobe_events entries
with entries found from probe-cache with *filename* and *address* as comparison
parameter (because this combo makes unique identifier for any SDT event in a
particular system - please correct me if that's wrong). And perf finds one entry
sdt_libpthread/mutex_entry_1.

Keeping this entry in 'pevs->tevs' and calling apply_perf_probe_events() will add
new entry for that event even if it exists. So this code removes that particular
element of an array and shift all remaining elements one step.

Please suggest if you have a better approach.

>> +
>> +/* Compare address and filename */
>> +static bool is_sdt_match(struct probe_trace_event *tev1,
>> + struct probe_trace_event *tev2)
>> +{
>> + return (tev1->point.address == tev2->point.address &&
>> + !(strcmp(tev1->point.module, tev2->point.module)));
>> +}
>> +
...
>> + probe_conf.max_probes = MAX_PROBES;
>> + probe_conf.force_add = 1;
>> +
>> + /*
>> + * Find the sdt event from the cache. We deliberately check failure
>> + * of this function after checking entries in uprobe_events. Because,
>> + * we may find matching entry from uprobe_events and in that case we
>> + * should continue recording that event.
>> + */
>> + pev->ntevs = find_sdt_events_from_cache(pev, &pev->tevs);
>> +
>> + /* Prepare list of existing sdt events from uprobe_events */
>> + INIT_LIST_HEAD(&esl.list);
>> + ret = probe_file__get_exst_sdt_list(&esl);
>> + if (ret < 0)
>> + goto free_str;
>> +
>> + /* If there is entry with the same name in uprobe_events, record it. */
>> + found = probe_file__add_exst_sdt_event(&esl, sdt_events, pev);
>> + if (found) {
>> + ret = (found > 0) ? 0 : found;
>> + goto free_str;
>> + }
> Curious, this should be done before searching cached SDT, so that
> we can just skip listing up existing probes.

Yes that's intentional. We warn user if event with exact name found from
uprobe_events and there are multiple instances of that event found from
probe-cache. For example,

$ readelf -n /usr/lib64/libpthread-2.24.so | grep -A2 Provider
Provider: libpthread
Name: mutex_entry
Location: 0x0000000000009ddb, Base: 0x00000000000139cc, ...
--
Provider: libpthread
Name: mutex_entry
Location: 0x000000000000bcbb, Base: 0x00000000000139cc, ...

$ cat /sys/kernel/debug/tracing/uprobe_events
p:sdt_libpthread/mutex_entry /usr/lib64/libpthread-2.24.so:0x0000000000009ddb

$ perf record -a -e sdt_libpthread/mutex_entry
Warning: Found 2 events from probe-cache with name 'sdt_libpthread:mutex_entry'.
Since probe point already exists with this name, recording only 1 event.
Hint: Please use 'perf probe -d sdt_libpthread:mutex_entry*' to allow record on all events.

>> +
>> + /* Reaching here means no matching entry found in uprobe_events. */
>> + filter = filter_exst_sdt_events_from_pev(pev, &esl);
>> + if (!filter && pev->ntevs == 0) {
>> + pr_err("%s:%s not found in the cache\n", pev->group,
>> + pev->event);
>> + ret = -EINVAL;
>> + goto free_str;
>> + } else if (pev->ntevs < 0) {
>> + err->str = strdup("Cache lookup failed.\n");
>> + ret = pev->ntevs;
>> + goto free_str;
>> + }
> Again, why is the esl needed here? We can remove events by using it's name.
>
> What I expected was,
>
> 0) add "generated_sdts" in record structure.
> 1) search existing events by using its name from tracing/events/.
> (Don't need to touch uprobe_events)

I didn't get your point to look at tracing/events/. It's easy to look into
uprobe_events, because perf has a builtin functions that easily transform
entries of uprobe_events to 'tevs'. And uprobe_events has all data like
eventname, filename, address which are crucial for this implementation.

> 2) if it does not exist, add new probe points and store its name in "generated_sdts".

This step is little difficult. As I mentioned earlier, major issue is, event
exists in uprobe_events but with different name. In such case I've to
reuse existing entries and also have to add new entries for events
which are absent.

> 3) record events
> 4) remove the events by using "generated_sdts" which only has the event name.
>

Let me know if I've misunderstood any of your point.

Thanks,
-Ravi