Re: [RFC PATCH v1 1/1] tracing/kprobe: Add multi-probe support for 'perf_kprobe' PMU

From: Francis Laniel
Date: Mon Aug 21 2023 - 08:55:22 EST


Hi.

Le samedi 19 août 2023, 03:11:05 CEST Masami Hiramatsu a écrit :
> Hi Francis,
> (Cc: Song Liu and BPF ML)
>
> On Fri, 18 Aug 2023 20:12:11 +0200
>
> Francis Laniel <flaniel@xxxxxxxxxxxxxxxxxxx> wrote:
> > Hi.
> >
> > Le vendredi 18 août 2023, 15:05:37 CEST Masami Hiramatsu a écrit :
> > > On Thu, 17 Aug 2023 13:06:20 +0200
> > >
> > > Francis Laniel <flaniel@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > Hi.
> > > >
> > > > Le jeudi 17 août 2023, 09:50:57 CEST Masami Hiramatsu a écrit :
> > > > > Hi,
> > > > >
> > > > > On Wed, 16 Aug 2023 18:35:17 +0200
> > > > >
> > > > > Francis Laniel <flaniel@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > When using sysfs, it is possible to create kprobe for several
> > > > > > kernel
> > > > > > functions sharing the same name, but of course with different
> > > > > > addresses,
> > > > > > by writing their addresses in kprobe_events file.
> > > > > >
> > > > > > When using PMU, if only the symbol name is given, the event will
> > > > > > be
> > > > > > created for the first address which matches the symbol, as
> > > > > > returned by
> > > > > > kallsyms_lookup_name().
> > > > >
> > > > > Do you mean probing the same name symbols? Yes, it is intended
> > > > > behavior,
> > > > > since it is not always true that the same name function has the same
> > > > > prototype (it is mostly true but is not ensured), it is better to
> > > > > leave
> > > > > user to decide which one is what you want to probe.
> > > >
> > > > This is what I meant.
> > > > I also share your mind regarding leaving the users deciding which one
> > > > they
> > > > want to probe but in my case (which I agree is a bit a corner one) it
> > > > leaded me to misunderstanding as the PMU kprobe was only added to the
> > > > first ntfs_file_write_iter() which is not the one for ntfs3.
> > >
> > > Hmm, OK. I think in that case (multiple same-name symbols exist) the
> > > default behavior is rejecting with error message. And optionally, it
> > > will probe all or them like your patch.
> >
> > I am not sure to understand.
> > Can you please precise the default behavior of which software component?
>
> I meant that the behavior of the kprobe-events via /sys/kernel/tracing.
> But your patch is for the other interface for perf as kprobe-event PMU.
> In that case, I think we should CC to other users like BPF because
> this may change the expected behavior.
>
> > > > > Have you used 'perf probe' tool? It tries to find the appropriate
> > > > > function
> > > > > by line number and creates the probe by 'text+OFFSET' style, not by
> > > > > symbol.
> > > > > I think this is the correct way to do that, because user will not
> > > > > know
> > > > > which 'address' of the symbol is what the user want.
> > > >
> > > > 'perf probe' perfectly does the trick, as it would find all the kernel
> > > > addresses which correspond to the symbol name and create as many
> > > > probes as
> > > > corresponding symbols [1]:
> > > > root@vm-amd64:~# perf probe --add ntfs_file_write_iter
> > >
> > > If you can specify the (last part of) file path as below,
> > >
> > > perf probe --add ntfs_file_write_iter@ntfs3/file.c
> > >
> > > Then it will choose correct one. :)
> >
> > Nice! TIL thank you! perf is really powerful!
>
> Yeah, but note that the perf-probe is a tool to setup a 'visible' tracepoint
> event. After making a new tracepoint event, the perf tool can use such
> "[Tracepoint event]" instead of PMU.
>
> Unfortunately, kprobe-event 'PMU' version doesn't support this
> because it has been introduced for BPF. See the original series;
>
> https://lore.kernel.org/lkml/20171206224518.3598254-1-songliubraving@xxxxxx/
>
> So, the "local_kprobe_event" is making a kprobe PMU which is a event for
> local session, that is designed for using such event from BPF (if I
> understand correctly). Of course BPF tool can setup its local
> event with a unique symbol + offset (not just a symbol) in a BPF tool with
> perf-probe but it doesn't.
>
> Could you tell me how do you use this feature, for what perpose?

Sure (I think I detailed this in the cover letter but I only sent it to the
"main" mailing list and not the tracing one, sorry for this inconvenience)!

Basically, I was adding NTFS tracing to an existing tool which monitors slow
I/Os using BPF [1].
To test the tool, I compiled a kernel with both NTFS module built-in and
figured out the write operations when done on ntfs3 were missing from the
output of the tool.
The problem comes from the library I use in the tool which does not handle
well when it exists different symbols with the same name.
Contrary to perf, which only handles kprobes through sysfs, the library
handles it in both way (sysfs and PMU) with a preference for PMU when
available [2].

After some discussion, I thought there could be a way to handle this
automatically in the kernel when using PMU kprobes, hence this patch.
I totally understand the case I described above is really a corner one, but I
thought this feature could be useful for other people.
In the case of the library itself, we could indeed find the address in /proc/
kallsyms but it would mean having CAP_SYS_ADMIN which is not forcefully
something we want to enforce.
Also, if we need to read /boot/vmlinuz or /boot/System.map we also need to be
root as these files often belong to root and cannot be read by others.
So, this patch solves the above problem while not needing specific capabilities
as the kernel will solve it for us.

> If you just need to trace/profile a specific function which has the same
> name symbols, you might be better to use `perf probe` +
> `/sys/kernel/tracing` or `perf record -e EVENT`.
>
> Or if you need to run it with CAP_PERFMON, without CAP_SYS_ADMIN,
> we need to change a userspace tool to find the correct address and
> pass it to the perf_event_open().
>
> > > > Added new events:
> > > > probe:ntfs_file_write_iter (on ntfs_file_write_iter)
> > > > probe:ntfs_file_write_iter (on ntfs_file_write_iter)
> > > >
> > > > You can now use it in all perf tools, such as:
> > > > perf record -e probe:ntfs_file_write_iter -aR sleep 1
> > > >
> > > > root@vm-amd64:~# cat /sys/kernel/tracing/kprobe_events
> > > > p:probe/ntfs_file_write_iter _text+5088544
> > > > p:probe/ntfs_file_write_iter _text+5278560
> > > >
> > > > > Thought?
> > > >
> > > > This contribution is basically here to sort of mimic what perf does
> > > > but
> > > > with PMU kprobes, as this is not possible to write in a sysfs file
> > > > with
> > > > this type of probe.
> > >
> > > OK, I see it is for BPF only. Maybe BPF program can filter correct one
> > > to access the argument etc.
> >
> > I am not sure I understand, can you please precise?
> > The eBPF program will be run when the kprobe will be triggered, so if the
> > kprobe is armed for the function (e.g. old ntfs_file_write_iter()), the
> > eBPF program will never be called.
>
> As I said above, it is userspace BPF loader issue, because it has to specify
> the correct address via unique symbol + offset, instead of attaching all of
> them. I think that will be more side-effects.
>
> But anyway, thanks for pointing this issue. I should fix kprobe event to
> reject the symbols which is not unique. That should be pointed by other
> unique symbols.

You are welcome and I thank you for the discussion.
Can you please precise more what you think about "reject the symbols which is
not unique"?
Basically something like this:
struct trace_event_call *
create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
bool is_return)
{
...
if (!addr && func) {
array.addrs = NULL;
array.size = 0;
ret = kallsyms_on_each_match_symbol(add_addr, func, &array);
if (ret)
goto error_free;

if (array.size != 1) {
/*
* Function name corresponding to several symbols must
* be passed by address only.
*/
return -EINVAL;
}
}
...
}
?
If my understanding is correct, I think I can write a patch to achieve this.

> Thank you,
>
> > > Thank you,
> > >
> > > > > Thank you,
> > > > >
> > > > > > The idea here is to search all kernel functions which match this
> > > > > > symbol
> > > > > > and
> > > > > > create a trace_kprobe for each of them.
> > > > > > All these trace_kprobes are linked together by sharing the same
> > > > > > trace_probe.
> > > > > >
> > > > > > Signed-off-by: Francis Laniel <flaniel@xxxxxxxxxxxxxxxxxxx>
> > > > > > ---
> > > > > >
> > > > > > kernel/trace/trace_kprobe.c | 86
> > > > > > +++++++++++++++++++++++++++++++++++++
> > > > > > 1 file changed, 86 insertions(+)
> > > > > >
> > > > > > diff --git a/kernel/trace/trace_kprobe.c
> > > > > > b/kernel/trace/trace_kprobe.c
> > > > > > index 1b3fa7b854aa..08580f1466c7 100644
> > > > > > --- a/kernel/trace/trace_kprobe.c
> > > > > > +++ b/kernel/trace/trace_kprobe.c
> > > > > > @@ -1682,13 +1682,42 @@ static int unregister_kprobe_event(struct
> > > > > > trace_kprobe *tk)>
> > > > > >
> > > > > > }
> > > > > >
> > > > > > #ifdef CONFIG_PERF_EVENTS
> > > > > >
> > > > > > +
> > > > > > +struct address_array {
> > > > > > + unsigned long *addrs;
> > > > > > + size_t size;
> > > > > > +};
> > > > > > +
> > > > > > +static int add_addr(void *data, unsigned long addr)
> > > > > > +{
> > > > > > + struct address_array *array = data;
> > > > > > + unsigned long *p;
> > > > > > +
> > > > > > + array->size++;
> > > > > > + p = krealloc(array->addrs,
> > > > > > + sizeof(*array->addrs) * array->size,
> > > > > > + GFP_KERNEL);
> > > > > > + if (!p) {
> > > > > > + kfree(array->addrs);
> > > > > > + return -ENOMEM;
> > > > > > + }
> > > > > > +
> > > > > > + array->addrs = p;
> > > > > > + array->addrs[array->size - 1] = addr;
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > >
> > > > > > /* create a trace_kprobe, but don't add it to global lists */
> > > > > > struct trace_event_call *
> > > > > > create_local_trace_kprobe(char *func, void *addr, unsigned long
> > > > > > offs,
> > > > > >
> > > > > > bool is_return)
> > > > > >
> > > > > > {
> > > > > >
> > > > > > enum probe_print_type ptype;
> > > > > >
> > > > > > + struct address_array array;
> > > > > >
> > > > > > struct trace_kprobe *tk;
> > > > > >
> > > > > > + unsigned long func_addr;
> > > > > > + unsigned int i;
> > > > > >
> > > > > > int ret;
> > > > > > char *event;
> > > > > >
> > > > > > @@ -1722,7 +1751,64 @@ create_local_trace_kprobe(char *func, void
> > > > > > *addr,
> > > > > > unsigned long offs,>
> > > > > >
> > > > > > if (ret < 0)
> > > > > >
> > > > > > goto error;
> > > > > >
> > > > > > + array.addrs = NULL;
> > > > > > + array.size = 0;
> > > > > > + ret = kallsyms_on_each_match_symbol(add_addr, func, &array);
> > > > > > + if (ret)
> > > > > > + goto error_free;
> > > > > > +
> > > > > > + if (array.size == 1)
> > > > > > + goto end;
> > > > > > +
> > > > > > + /*
> > > > > > + * Below loop allocates a trace_kprobe for each function
with
> > > > > > the
> > > > > > same
> > > > > > + * name in kernel source code.
> > > > > > + * All this differente trace_kprobes will be linked together
> > > > > > through
> > > > > > + * append_trace_kprobe().
> > > > > > + * NOTE append_trace_kprobe() is called in
> > > > > > register_trace_kprobe()
> > > >
> > > > which
> > > >
> > > > > > + * is called when a kprobe is added through sysfs.
> > > > > > + */
> > > > > > + func_addr = kallsyms_lookup_name(func);
> > > > > > + for (i = 0; i < array.size; i++) {
> > > > > > + struct trace_kprobe *tk_same_name;
> > > > > > + unsigned long address;
> > > > > > +
> > > > > > + address = array.addrs[i];
> > > > > > + /* Skip the function address as we already registered
it. */
> > > > > > + if (address == func_addr)
> > > > > > + continue;
> > > > > > +
> > > > > > + /*
> > > > > > + * alloc_trace_kprobe() first considers symbol name,
so we set
> > > > > > + * this to NULL to allocate this kprobe on the given
address.
> > > > > > + */
> > > > > > + tk_same_name =
alloc_trace_kprobe(KPROBE_EVENT_SYSTEM, event,
> > > > > > + (void *)address, NULL,
offs,
> > > > > > + 0 /* maxactive */,
> > > > > > + 0 /* nargs */,
is_return);
> > > > > > +
> > > > > > + if (IS_ERR(tk_same_name)) {
> > > > > > + ret = -ENOMEM;
> > > > > > + goto error_free;
> > > > > > + }
> > > > > > +
> > > > > > + init_trace_event_call(tk_same_name);
> > > > > > +
> > > > > > + if (traceprobe_set_print_fmt(&tk_same_name->tp,
ptype) < 0) {
> > > > > > + ret = -ENOMEM;
> > > > > > + goto error_free;
> > > > > > + }
> > > > > > +
> > > > > > + ret = append_trace_kprobe(tk_same_name, tk);
> > > > > > + if (ret)
> > > > > > + goto error_free;
> > > > > > + }
> > > > > > +
> > > > > > +end:
> > > > > > + kfree(array.addrs);
> > > > > >
> > > > > > return trace_probe_event_call(&tk->tp);
> > > > > >
> > > > > > +error_free:
> > > > > > + kfree(array.addrs);
> > > > > >
> > > > > > error:
> > > > > > free_trace_kprobe(tk);
> > > > > > return ERR_PTR(ret);
> > > >
> > > > ---
> > > > [1]: https://github.com/torvalds/linux/blob/
> > > > 57012c57536f8814dec92e74197ee96c3498d24e/tools/perf/util/probe-event.c
> > > > #L29
> > > > 89- L2993

Best regards.
---
[1]: https://github.com/inspektor-gadget/inspektor-gadget/pull/1879
[2]: https://github.com/cilium/ebpf/blob/
270c859894bd38cdd0c7783317b16343409e4df8/link/kprobe.go#L165-L191