Re: Ftrace vs perf user page fault statistics differences

From: Steven Rostedt
Date: Tue Jun 13 2017 - 16:04:41 EST


On Tue, 13 Jun 2017 14:02:08 -0400
Will Hawkins <whh8b@xxxxxxxxxxxx> wrote:

> Thank you for pointing this out. I had been using -F for exactly the
> reason that you mentioned. I failed to include it in the command that
> I sent along. Very sorry for the confusion. Here is an updated version
> of the command that I issued:
>
>
> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
> handle_mm_fault -F ../one_page_play/page
>
> and I generated output like
>
> ./trace-cmd report --profile
>
> and I see the following (among some other output):
>
> Event: page_fault_user:0x7f094f7dd104 (1)
> Event: page_fault_user:0x4000e0 (1)
> Event: page_fault_user:0x7f094f7eae4a (1)
> Event: page_fault_user:0x7f094f860d40 (1)
> Event: page_fault_user:0x7f094f7db560 (1)
> Event: page_fault_user:0x4040cb (1)
> Event: page_fault_user:0x401825 (1)
> Event: page_fault_user:0x401473 (1)
> Event: page_fault_user:0x7f094f7e64c4 (1)
> Event: page_fault_user:0x7f094f7f1212 (1)
>
> That output comes from under the task: page-<pid> heading, so it seems
> like those faults are being attributed to the page task.
>
> This command seems to show something interesting:
>
> sudo ./trace-cmd record -e exceptions:page_fault_user -p
> function_graph -g __do_fault -F ../one_page_play/page
>
> and the relevant output from
>
> ./trace-cmd report --profile
>
> is
>
> task: page-4032
> Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
> 2398(ts:170150.060916) Min:855(ts:170150.054713)
> Event: page_fault_user:0x7ffad3143d40 (1)
> Event: page_fault_user:0x4000e0 (1)
> Event: page_fault_user:0x401473 (1)
> Event: page_fault_user:0x7ffad30c94c4 (1)
>
> This is closer to what I would expect. The first of the two 0x4...
> addresses is the entry point and the second is the target. Basically,
> that is exactly what I expect. The other two are the "suspicious"
> entries. Neither matches the copy_user_enhanced_fast_string symbol
> location and are not loaded in the binary (according to gdb).

As you state below, there is faults recorded before the exec. Which is
true with trace-cmd (not sure about perf). As trace-cmd does do some
work after tracing is started and before the exec is called.

>
> It is odd to me that the output from the previous command includes
> information about the trace-cmd process since I specified the -F
> option.

Did the trace-cmd process have the same pid as the page program? The -F
makes it follow the pid, which can be trace-cmd before the fork.

Oh, and what versions are you running of the kernel as well as
trace-cmd. It's best to always get the latest trace-cmd. Older versions
did have some bugs.

Also, function_graph can have some residual traces from a sched_switch.

That is, the tracing of a task specified by -F is enabled and disabled
per cpu at sched_switch. The sched_switch tracepoint is before the
actual context switch. When the next task is the task specified by -F,
tracing begins. But the context switch can start tracing the previous
task right after the sched switch tracepoint. Here, in
kernel/sched/core.c in __schedule():

trace_sched_switch(preempt, prev, next);

/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);


The context_switch() function will be traced even when the previous
task isn't the task from -F.


>
> But, back to exactly what you asked. Here is the result of running
> perf again with the -a option. I ran this command:
>
> sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
>
> 50.41% page ld-2.19.so [.] do_lookup_x
> |
> --- do_lookup_x
>
> 44.21% perf [kernel.kallsyms] [k] iov_iter_fault_in_readable
> |
> --- iov_iter_fault_in_readable
> generic_file_buffered_write
> __generic_file_aio_write
> generic_file_aio_write
> ext4_file_write
> do_sync_write
> vfs_write
> sys_write
> system_call_fastpath
> __write_nocancel
> 0x4081a5
> 0x407a40
> __libc_start_main
>
> 4.13% perf perf [.] 0x0000000000015b54
> |
> --- 0x415b54
> 0x4081a5
> 0x407a40
> __libc_start_main
>
> 0.41% page page [.] _start
> |
> --- _start
>
> 0.41% page page [.] target
> |
> --- target
>
> 0.41% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> |
> --- copy_user_enhanced_fast_string
> load_elf_binary
> search_binary_handler
> do_execve_common.isra.23
> sys_execve
> stub_execve
> __execve
>
>
> What is interesting, is that the output differs based on whether I've
> dropped the kernel caches before I run the perf record. When I do
> that, there are no page faults attributed to the entry point or the
> target of the program. I would imagine that, after dropping caches,
> the readahead handler picks up those pages when the binary is loaded
> and negates the need for a page fault. That, indeed, seems to be the
> case. I can see that when I run perf with an additional
> block:block_rq_issue event. Immediately after dropping the caches,
> there is a block request event. On subsequent executions, there is no
> such event but there are the page faults that I expect.
>
> What I did notice is that some of the page faults come from the task
> before execve is called. From what I've seen, this looks like the
> kernel reclaiming pages from the spawning process before it is
> replaced with the new binary (during the call to execve). After the
> execve, there seem to be to page faults:
>
>
> page-4613 [006] 171795.748310: funcgraph_entry:
> 0.151 us | mutex_unlock();
> page-4613 [006] 171795.748313: funcgraph_entry:
> 0.166 us | __fsnotify_parent();
> page-4613 [006] 171795.748313: funcgraph_entry:
> 0.321 us | fsnotify();
> page-4613 [006] 171795.748314: funcgraph_entry:
> 0.090 us | __sb_end_write();
> page-4613 [006] 171795.748317: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [006] 171795.748317: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x4
> page-4613 [006] 171795.748319: funcgraph_exit:
> 2.254 us | }
> page-4613 [006] 171795.748321: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [006] 171795.748322: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [006] 171795.748323: funcgraph_exit:
> 1.144 us | }
>
> NOTICE THIS:
> page-4613 [006] 171795.748324: funcgraph_entry:
> | sys_execve() {
> page-4613 [007] 171795.748391: block_rq_issue: 8,0
> R 0 () 764812912 + 16 [trace-cmd]
> page-4613 [005] 171795.759476: funcgraph_exit: #
> 11152.111 us | }
> page-4613 [005] 171795.759477: funcgraph_entry:
> 3.745 us | do_notify_resume();
> page-4613 [005] 171795.759481: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [005] 171795.759482: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [005] 171795.759487: funcgraph_exit:
> 5.833 us | }
> page-4613 [005] 171795.759488: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [005] 171795.759489: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [005] 171795.759490: funcgraph_exit:
> 2.003 us | }
> page-4613 [005] 171795.759492: funcgraph_entry:
> | sys_exit() {
>
> I wish that I could get something "better" than "
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
> faults. That would really tell me more about whether this is the
> "correct" behavior.

Also, you might want to do a trace-cmd report -l, which lets you see if
the event happened in interrupt context.

>
> As ever, thank you very much for your help! Using these tools has been
> an incredible learning experience. I still think that I am just
> missing something stupid, but I really appreciate your patience.

No no, you got me curious as well. Although, the only page faults I see
are from before the exec and the actual program. Could you run the -F
on your program with the other options you have and report a full:

trace-cmd report -l

And then you can ask about what you don't understand being there.

-- Steve


-- Steve