Re: [RFC PATCH 27/30] Code tagging based latency tracking

From: Kent Overstreet
Date: Thu Sep 01 2022 - 17:55:00 EST


On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote:
> On Tue, 30 Aug 2022 14:49:16 -0700
> Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> > From: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> >
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> >
> > code_tag_time_stats_start(start_time);
> > code_tag_time_stats_finish(start_time);
>
> So you need to modify the code to see what you want?

Figuring out the _correct_ place to measure is often a significant amount of the
total effort.

Having done so once, why not annotate that in the source code?

> For function length you could just do something like this:
>
> # cd /sys/kernel/tracing
> # echo __skb_wait_for_more_packets > set_ftrace_filter
> # echo 1 > function_profile_enabled
> # cat trace_stat/function*
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> __skb_wait_for_more_packets 1 0.000 us 0.000 us 0.000 us
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> __skb_wait_for_more_packets 1 74.813 us 74.813 us 0.000 us
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
>
> The above is for a 4 CPU machine. The s^2 is the square of the standard
> deviation (makes not having to do divisions while it runs).
>
> But if you are looking for latency between two events (which can be kprobes
> too, where you do not need to rebuild your kernel):
>
> From: https://man.archlinux.org/man/sqlhist.1.en
> which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
> if not already installed on your distro.
>
> # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100'
>
> The above creates a synthetic event called "wakeup_lat" that joins two
> events (sched_waking and sched_switch) when the pid field of sched_waking
> matches the next_pid field of sched_switch. When there is a match, it will
> trigger the wakeup_lat event only if the prio of the sched_waking event is
> less than 100 (which in the kernel means any real-time task). The
> wakeup_lat event will record the next_comm (as comm field), the pid of
> woken task and the time delta in microseconds between the two events.

So this looks like it's gotten better since I last looked, but it's still not
there yet.

Part of the problem is that the tracepoints themselves are in the wrong place:
your end event is when a task is woken up, but that means spurious wakeups will
cause one wait_event() call to be reported as multiple smaller waits, not one
long wait - oops, now I can't actually find the thing that's causing my
multi-second delay.

Also, in your example you don't have it broken out by callsite. That would be
the first thing I'd need for any real world debugging.

So, it looks like tracing has made some progress over the past 10 years, but
for debugging latency issues it's still not there yet in general. I will
definitely remember function latency tracing the next time I'm doing performance
work, but I expect that to be far too heavy to enable on a live server.

This thing is only a couple hundred lines of code though, so perhaps tracing
shouldn't be the only tool in our toolbox :)