Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

From: Steven Rostedt
Date: Thu May 23 2019 - 22:00:44 EST


On Thu, 23 May 2019 17:31:50 -0700
Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:


> > Now from what I'm reading, it seams that the Dtrace layer may be
> > abstracting out fields from the kernel. This is actually something I
> > have been thinking about to solve the "tracepoint abi" issue. There's
> > usually basic ideas that happen. An interrupt goes off, there's a
> > handler, etc. We could abstract that out that we trace when an
> > interrupt goes off and the handler happens, and record the vector
> > number, and/or what device it was for. We have tracepoints in the
> > kernel that do this, but they do depend a bit on the implementation.
> > Now, if we could get a layer that abstracts this information away from
> > the implementation, then I think that's a *good* thing.
>
> I don't like this deferred irq idea at all.

What do you mean deferred?

> Abstracting details from the users is _never_ a good idea.

Really? Most everything we do is to abstract details from the user. The
key is to make the abstraction more meaningful than the raw data.

> A ton of people use bcc scripts and bpftrace because they want those details.
> They need to know what kernel is doing to make better decisions.
> Delaying irq record is the opposite.

I never said anything about delaying the record. Just getting the
information that is needed.

> >
> > I wish that was totally true, but tracepoints *can* be an abi. I had
> > code reverted because powertop required one to be a specific
> > format. To this day, the wakeup event has a "success" field that
> > writes in a hardcoded "1", because there's tools that depend on it,
> > and they only work if there's a success field and the value is 1.
>
> I really think that you should put powertop nightmares to rest.
> That was long ago. The kernel is different now.

Is it?

> Linus made it clear several times that it is ok to change _all_
> tracepoints. Period. Some maintainers somehow still don't believe
> that they can do it.

>From what I remember him saying several times, is that you can change
all tracepoints, but if it breaks a tool that is useful, then that
change will get reverted. He will allow you to go and fix that tool and
bring back the change (which was the solution to powertop).

>
> Some tracepoints are used more than others and more people will
> complain: "ohh I need to change my script" when that tracepoint
> changes. But the kernel development is not going to be hampered by a
> tracepoint. No matter how widespread its usage in scripts.

That's because we'll treat bpf (and Dtrace) scripts like modules (no
abi), at least we better. But if there's a tool that doesn't use the
script and reads the tracepoint directly via perf, then that's a
different story.

-- Steve

>
> Sometimes that pain of change can be mitigated a bit. Like that
> 'success' field example, but tracepoints still change.
> Meaningful value before vs hardcoded constant is still a breakage for
> some scripts.