Re: [RFC PATCH tip 0/5] tracing filters with BPF

From: Masami Hiramatsu
Date: Tue Dec 10 2013 - 22:36:02 EST

Next message: David Ahern: "Re: [PATCH V3] perf tools: Change the default filenames for perfkvm diff to perf.data.xxx and perf.data.xxx.old"
Previous message: Dongsheng Yang: "[PATCH V3] perf tools: Change the default filenames for perf kvm diff to perf.data.xxx and perf.data.xxx.old"
In reply to: Alexei Starovoitov: "Re: [RFC PATCH tip 0/5] tracing filters with BPF"
Next in thread: Alexei Starovoitov: "Re: [RFC PATCH tip 0/5] tracing filters with BPF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(2013/12/11 11:32), Alexei Starovoitov wrote:
> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>>
>> * Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>
>>>> I'm fine if it becomes a requirement to have a vmlinux built with
>>>> DEBUG_INFO to use BPF and have a tool like perf to translate the
>>>> filters. But it that must not replace what the current filters do
>>>> now. That is, it can be an add on, but not a replacement.
>>>
>>> Of course. tracing filters via bpf is an additional tool for kernel
>>> debugging. bpf by itself has use cases beyond tracing.
>>
>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
>> most people.
>
> there is a misunderstanding here.
> I was saying 'of course' to 'not replace current filter infra'.
>
> bpf does not depend on debug info.
> That's the key difference between 'perf probe' approach and bpf filters.
>
> Masami is right that what I was trying to achieve with bpf filters
> is similar to 'perf probe': insert a dynamic probe anywhere
> in the kernel, walk pointers, data structures, print interesting stuff.
>
> 'perf probe' does it via scanning vmlinux with debug info.
> bpf filters don't need it.
> tools/bpf/trace/*_orig.c examples only depend on linux headers
> in /lib/modules/../build/include/
> Today bpf compiler struct layout is the same as x86_64.
>
> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
> of the front-end. Similar to -m32/-m64 and -m*-endian flags.
> Neat part is that I don't need to do any work, just enable it properly in
> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
> architecture that compiler is emitting code for.
> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
> field offset by looking at /lib/modules/.../include/skbuff.h
> whereas for 'perf probe' 'skb->dev' means walk debug info.

Right, the offset of the data structure can get from the header etc.

However, how would the bpf get the register or stack assignment of
skb itself? In the tracepoint macro, it will be able to get it from
function parameters (it needs a trick, like jprobe does).
I doubt you can do that on kprobes/uprobes without any debuginfo
support. :(

And is it possible to trace a field in a data structure which is
defined locally in somewhere.c ? :) (maybe it's just a corner case)

> Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
> walks all data structures in the same way x86_64 does it.
> Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
> Note that all -m* flags will be in one compiler. It won't grow any bigger
> because of that. All of it already supported by C front-ends.
> It may sound complex, but really very little code for the bpf backend.
>
> I didn't look inside systemtap/ktap enough to say how much they're
> relying on presence of debug info to make a comparison.
>
> I see two main use cases for bpf tracing filters: debugging live kernel
> and collecting stats. Same tricks that [sk]tap do with their maps.
> Or may be some of the stats that 'perf record' collects in userspace
> can be collected by bpf filter in kernel and stored into generic bpf table?
>
>> Would it be possible to make BFP filters recognize exposed details
>> like the current filters do, without depending on the vmlinux?
>
> Well, if you say that presence of linux headers is also too much to ask,
> I can hook bpf after probes stored all the args.
>
> This way current simple filter syntax can move to userspace.
> 'arg1==x || arg2!=y' can be parsed by userspace, bpf code
> generated and fed into kernel. It will be faster than walk_pred_tree(),
> but if we cannot remove 2k lines from trace_events_filter.c
> because of backward compatibility, extra performance becomes
> the only reason to have two different implementations.
>
> Another use case is to optimize fetch sequences of dynamic probes
> as Masami suggested, but backward compatibility requirement
> would preserve to ways of doing it as well.

The backward compatibility issue is only for the interface, but not
for the implementation, I think. :) The fetch method and filter
pred do already parse the argument into a syntax tree. IMHO, bpf
can optimize that tree to just a simple opcode stream.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@xxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Ahern: "Re: [PATCH V3] perf tools: Change the default filenames for perfkvm diff to perf.data.xxx and perf.data.xxx.old"
Previous message: Dongsheng Yang: "[PATCH V3] perf tools: Change the default filenames for perf kvm diff to perf.data.xxx and perf.data.xxx.old"
In reply to: Alexei Starovoitov: "Re: [RFC PATCH tip 0/5] tracing filters with BPF"
Next in thread: Alexei Starovoitov: "Re: [RFC PATCH tip 0/5] tracing filters with BPF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]