Re: [PATCH 3/3] trace-cmd: Making stat to report when the stack tracer is ON

From: Steven Rostedt
Date: Wed Nov 29 2017 - 11:18:28 EST


On Wed, 29 Nov 2017 16:00:54 +0200
Vladislav Valtchev <vladislav.valtchev@xxxxxxxxx> wrote:

> On Wed, 2017-11-29 at 07:57 -0500, Steven Rostedt wrote:
> >
> > Let's think about what the user wants.
> >
> > If you do a "trace-cmd stat" what are you looking for? You want to see
> > what ftrace operations are available. Now let's say we do something
> > weird, or someone has some weird modified kernel, and the stack tracer
> > shows something that trace-cmd doesn't expect. With a die, it kills the
> > tool.
> >
> > Would you like it if you ran "trace-cmd stat" and got it crashed with
> > an error message saying the kernel is doing something it doesn't
> > understand? To me, I'd be pissed. I would be cursing at trace-cmd
> > saying "I don't give a frick about that, show me what you do know!"
> >
>
> I'm surprised how different kind of users are we :-)
>
> To me as user, in case there is a kernel bug, the best thing I'd expect
> to see is the tool refusing to work and reporting that it does not really
> know the state of the tracer because of invalid data in tracefs.
>
> In other words, I expect a tool to behave like:
> "I don't know what is that, so I cannot take any decisions.
> Here's the detailed problem (err msg, data). Now only a human may help now".
>
> The other approach is instead:
> "I don't know what is that, but I'll guess by best trying to not piss off the user".

No, I want "I don't know what this is (tell user about it) and carry
on."

The point being, trace-cmd stat does a lot more than check if the stack
tracer is on. If it can't figure that out, it should warn that it got
confused about it, but it should still report about all the other
tracing that it does know about.

And who said there was a bug? It could be a modified kernel that was
done on purpose. Why should that kill trace-cmd?


>
> Both approaches have PROs and CONs. It is evident that, in the first case the tool is
> pedantic and won't give even a try to do something. In the 2nd case instead, the tool
> might guess even correctly at first but, by not exposing the underlying issue, it might
> fail (if there's a mem corruption at kernel-level, likely it will) at any moment in a
> strange way. In any case, the user will be pissed. Just, in the first case he/she will
> benefit from an "early-stage" error, that might make the problem easier to find.
> Also, with the 2nd approach, the user won't figure out immediately that the tool is not
> guilty, while the kernel is: nobody should blame the poor tool when it had no chance
> to get its job done in the first place.

It should warn, and continue. It shouldn't die. Warning lets the user
know that there was some kind of anomaly that trace-cmd doesn't know
of, and the user can investigate further if they want to. Or the user
could say "oh yeah, I modified this kernel, or I have an out of date
trace-cmd, no problem, it still gives me the information I'm looking
for."

I see no CON with my approach, but I see many with yours.


>
> > Now, do you think having a "die" is good there?
>
> I prefer the "fail-early" approach in general. For a tool like trace-cmd,
> I'd implement a layer validating all the input with an option for controlling
> validation in hot paths. BUT, since that is not the philosophy of the tool,
> adding a check like that only there, does not make much sense.
>
> It makes sense to take an approach and consistently follow it.
> So, I'll fix my patch as you suggested.
>
>
> I hope I'm not "pissing off" you with my long comments :-)

Nope not at all :-)

I'm just trying to educate you. Please note, the kernel itself does the
same thing. And Linus has yelled at people for using BUG_ON() instead
of WARN_ON(). He says, don't crash my kernel just because your code
screwed up!

ftrace itself has lots of self checks. It will shut itself down if it
finds an anomaly, but it doesn't crash the kernel. There's one
exception, and that's when it gets into a code path during function
graph tracing where there's no place to return to. That happened once,
and was due to a bug in gcc that caused function graph tracing to make
all calls it called not return properly. There was no recovery. But
that's the exception and not the rule.

-- Steve