Re: [RFC][PATCH 0/4] tracing: Add new hwlat_detector tracer

From: Steven Rostedt
Date: Thu Apr 23 2015 - 19:23:47 EST


On Thu, 23 Apr 2015 15:50:29 -0700
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, Apr 23, 2015 at 1:21 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >
> > But at least on the machines which have the event counter it would be
> > usefull to include that information as well.
>
> In fact, I'd argue that we should *not* do this odd magic latency
> measurement thing at all, exactly because Intel gave is the SMI
> counter, and it's much more likely to be useful in real life. The odd
> "stop machine and busy loop adn do magic" thing is a incredibly
> invasive hack that sane people will never enable at all, while the

No sane person should enable this on any production machine, and nor
should they use the other latency tracer (irqsoff and friends). But we
have used this odd magic latency measurement in dealings with vendors
and such in certifying their machines. Thus, this has not been
something that we just wanted to throw into the kernel for fun. This
tool has actually been very helpful to us.

> "add support for the hadrware we asked for and got" is a small thing
> that we can do on all modern Intel chips, and can be enabled by
> default because there is no downside.
>

What about a mix and match tracer? If the hardware supports SMI
counters (and more importantly, SMI cycle counters), it will just use
that, otherwise if the hardware does not support the SMI counting, it
falls back to the odd magic latency measurement thing.

I could even make the odd magic latency measurement thing only be
enabled via a tracer flag, such that it would be safe to use the SMI
counter if supported, but if it isn't supported, a tracing message will
display info about the more invasive (not to be used in production
environment) measurement. But the more invasive version will only be
activated if the user explicitly set it (even if SMI counters were not
available).

And when this was first proposed, it was called smi_detector, and I
believe it was Andrew that suggested to rename it to hwlat_detector,
because it could theoretically, detect other types of hardware issues
that could stop the CPU, in case something like that ever arise.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/