Re: [PATCH 3/3] x86/ftrace: Use text_poke()

From: Daniel Bristot de Oliveira
Date: Fri Oct 04 2019 - 04:10:57 EST


On 04/10/2019 00:10, Steven Rostedt wrote:
> On Wed, 2 Oct 2019 20:21:06 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>> On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>> ftrace was already batching the updates, for instance, causing 3 IPIs to enable
>>> all functions. The text_poke() batching also works. But because of the limited
>>> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during the
>>> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a VM,
>>> when enabling the function tracer, I see 250+ intermediate text_poke_finish()
>>> because of a full buffer...
>>>
>>> Would this be the case of trying to use a dynamically allocated buffer?
>>>
>>> Thoughts?
>>
>> Is it a problem? I tried growing the buffer (IIRC I made it 10 times
>> bigger) and didn't see any performance improvements because of it.
>
> I'm just worried if people are going to complain about the IPI burst.
> Although, I just tried it out before applying this patch, and there's
> still a bit of a burst. Not sure why. I did:
>
> # cat /proc/interrupts > /tmp/before; echo function > /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after
> # cat /proc/interrupts > /tmp/before1; echo nop > /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after1
>
> Before this patch:
>
> # diff /tmp/before /tmp/after
> < CAL: 2342 2347 2116 2175 2446 2030 2416 2222 Function call interrupts
> ---
>> CAL: 2462 2467 2236 2295 2446 2150 2536 2342 Function call interrupts
>
> (Just showing the function call interrupts)
>
> There appears to be 120 IPIs sent to all CPUS for enabling function tracer.
>
> # diff /tmp/before1 /tmp/after1
> < CAL: 2462 2467 2236 2295 2446 2150 2536 2342 Function call interrupts
> ---
>> CAL: 2577 2582 2351 2410 2446 2265 2651 2457 Function call interrupts
>
> And 151 IPIs for disabling it.
>
> After applying this patch:
>
> # diff /tmp/before /tmp/after
> < CAL: 66070 46620 59955 59236 68707 63397 61644 62742 Function call interrupts
> ---
>> CAL: 66727 47277 59955 59893 69364 64054 62301 63399 Function call interrupts
>
> # diff /tmp/before1 /tmp/after1
> < CAL: 66727 47277 59955 59893 69364 64054 62301 63399 Function call interrupts
> ---
>> CAL: 67358 47938 59985 60554 70025 64715 62962 64060 Function call interrupts
>
>
> We get 657 IPIs for enabling function tracer, and 661 for disabling it.
> Funny how it's more on the disable than the enable with the patch but
> the other way without it.
>
> But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> it's a problem, but something that we should note. Someone (those that
> don't like kernel interference) may complain.

That is the point I was raising.

When enabling ftrace, we have three different paths:

1) the enabling/disabling ftrace path
2) the int3 path - if a thread/irq is running a kernel function
3) the IPI - that affects all CPUs, even those that are not "hitting" trace
code, e.g., user-space.

The first one is for sure a cold-path. The second one is a hot-path: any task
running kernel functions will hit it. But IMHO, the hottest one is the IPIs,
because it will run on all CPUs, e.g., even isolated CPUs that are running in
user-space.

Currently, ftrace does:

for_ftrace_rec:
Install all breakpoints
send IPI

for_ftrace_rec:
write the end of the instruction
send IPI

for_ftrace_rec:
write the first byte of the instruction
send IPI

And that is the same thing we do with the batch mode, and so it would be better
to integrate both.

The problem is that considering the patch 2/3, the amount of entries we can
batch in the text_poke is limited, and so we batch on chunks of TP_VEC_MAX
entries. So, rather than doing 3 IPIs to change the code, we do:

(ftrace_rec count/TP_VEC_MAX) * 3 IPIs.

One possible solution for this would be to allocate a buffer with "number of
ftrace_rec" and use it in the text_poke batch mode.

But to do it, we should keep the old interface (the one that the 2/3 is
changing). Well, at least using a per-use-case buffer.

[ In addition ]

Currently, ftrace_rec entries are ordered inside the group of functions, but
"groups of function" are not ordered. So, the current int3 handler does a (*):

for_each_group_of_functions:
check if the ip is in the range ----> n by the number of groups.
do a bsearch. ----> log(n) by the numbers of entry
in the group.

If, instead, it uses an ordered vector, the complexity would be log(n) by the
total number of entries, which is better. So, how bad is the idea of:

in the enabling ftrace code path, it:
discover the number of entries
alloc a buffer
discover the order of the groups
for each group in the correct order
queue the entry in the buffer
apply the changes using the text_poke...

In this way we would optimize the two hot-paths:
int3 will be log(n)
IPIs bounded to 3.

I am not saying we need to do it now, as Steve said, not sure if this is a big
problem, but... those that don't like kernel interference may complain. But if
we leave the per-use-case vector in the text_poke_batch interface, things will
be easier to fix.

NOTE: the other IPIs are generated by hooking the tracepoints and switching the
code to RO/RW...

* as far as I understood ftrace_location_range().

-- Daniel

> -- Steve
>