Re: [LKP] [SUNRPC] 0472e47660: fsmark.app_overhead 16.0% regression

From: Xing Zhengjun
Date: Wed Aug 07 2019 - 03:56:31 EST




On 7/24/2019 1:17 PM, Xing Zhengjun wrote:


On 7/12/2019 2:42 PM, Xing Zhengjun wrote:
Hi Trond,

ÂÂÂÂ I attached perf-profile part big changes, hope it is useful for analyzing the issue.

Ping...

ping...




In testcase: fsmark
on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
with following parameters:

ÂÂÂÂÂÂÂÂ iterations: 20x
ÂÂÂÂÂÂÂÂ nr_threads: 64t
ÂÂÂÂÂÂÂÂ disk: 1BRD_48G
ÂÂÂÂÂÂÂÂ fs: xfs
ÂÂÂÂÂÂÂÂ fs2: nfsv4
ÂÂÂÂÂÂÂÂ filesize: 4M
ÂÂÂÂÂÂÂÂ test_size: 80G
ÂÂÂÂÂÂÂÂ sync_method: fsyncBeforeClose
ÂÂÂÂÂÂÂÂ cpufreq_governor: performance

test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
test-url: https://sourceforge.net/projects/fsmark/

commit:
ÂÂ e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()")
ÂÂ 0472e47660 ("SUNRPC: Convert socket page send code to use iov_iter()")

e791f8e9380d945e 0472e476604998c127f3c80d291
---------------- ---------------------------
ÂÂÂÂÂÂÂÂÂ %stddevÂÂÂÂ %changeÂÂÂÂÂÂÂÂ %stddev
ÂÂÂÂÂÂÂÂÂÂÂÂÂ \ÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ \
ÂÂÂÂ 527.29ÂÂÂÂÂÂÂÂÂÂ -22.6%ÂÂÂÂ 407.96ÂÂÂÂÂÂÂ fsmark.files_per_sec
ÂÂÂÂÂÂ 1.97 Â 11%ÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 2.88 ÂÂ 4% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 0.93 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_write_xmit.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages
ÂÂÂÂÂÂ 2.11 Â 10%ÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 3.05 ÂÂ 4% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
ÂÂÂÂÂÂ 5.29 ÂÂ 2%ÂÂÂÂÂ +1.2ÂÂÂÂÂÂÂ 6.46 ÂÂ 7% perf-profile.calltrace.cycles-pp.svc_recv.nfsd.kthread.ret_from_fork
ÂÂÂÂÂÂ 9.61 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.70 ÂÂ 2% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
ÂÂÂÂÂÂ 9.27 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.40 ÂÂ 2% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork
ÂÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.calltrace.cycles-pp.ret_from_fork
ÂÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.4ÂÂÂÂÂÂÂ 3.41 ÂÂ 4% perf-profile.calltrace.cycles-pp.memcpy_erms.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.4ÂÂÂÂÂÂÂ 3.44 ÂÂ 4% perf-profile.calltrace.cycles-pp.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.5ÂÂÂÂÂÂÂ 3.54 ÂÂ 4% perf-profile.calltrace.cycles-pp._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages
ÂÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.calltrace.cycles-pp.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread.kthread
ÂÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.calltrace.cycles-pp.rpc_async_schedule.process_one_work.worker_thread.kthread.ret_from_fork
ÂÂÂÂÂÂ 1.81 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.59 ÂÂ 4% perf-profile.calltrace.cycles-pp.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread
ÂÂÂÂÂÂ 1.80 ÂÂ 3%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.59 ÂÂ 3% perf-profile.calltrace.cycles-pp.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work
ÂÂÂÂÂÂ 1.73 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.54 ÂÂ 4% perf-profile.calltrace.cycles-pp.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule
ÂÂÂÂÂÂ 1.72 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.54 ÂÂ 4% perf-profile.calltrace.cycles-pp.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.4ÂÂÂÂÂÂÂ 5.42 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.5ÂÂÂÂÂÂÂ 5.52 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.5ÂÂÂÂÂÂÂ 5.53 ÂÂ 4% perf-profile.calltrace.cycles-pp.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit
ÂÂÂÂÂÂ 9.61 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.70 ÂÂ 2% perf-profile.children.cycles-pp.worker_thread
ÂÂÂÂÂÂ 9.27 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.40 ÂÂ 2% perf-profile.children.cycles-pp.process_one_work
ÂÂÂÂÂÂ 6.19ÂÂÂÂÂÂÂÂÂÂÂ +3.2ÂÂÂÂÂÂÂ 9.40 ÂÂ 4% perf-profile.children.cycles-pp.memcpy_erms
ÂÂÂÂÂ 34.53 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.children.cycles-pp.ret_from_fork
ÂÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.children.cycles-pp.kthread
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.5ÂÂÂÂÂÂÂ 3.46 ÂÂ 4% perf-profile.children.cycles-pp.memcpy_from_page
ÂÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.6ÂÂÂÂÂÂÂ 3.56 ÂÂ 4% perf-profile.children.cycles-pp._copy_from_iter_full
ÂÂÂÂÂÂ 2.47 ÂÂ 4%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.18 ÂÂ 3% perf-profile.children.cycles-pp.__rpc_execute
ÂÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.children.cycles-pp.rpc_async_schedule
ÂÂÂÂÂÂ 1.90 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.67 ÂÂ 3% perf-profile.children.cycles-pp.call_transmit
ÂÂÂÂÂÂ 1.89 ÂÂ 3%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.66 ÂÂ 3% perf-profile.children.cycles-pp.xprt_transmit
ÂÂÂÂÂÂ 1.82 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.62 ÂÂ 3% perf-profile.children.cycles-pp.xs_tcp_send_request
ÂÂÂÂÂÂ 1.81 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.62 ÂÂ 3% perf-profile.children.cycles-pp.xs_sendpages
ÂÂÂÂÂÂ 0.21 Â 17%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.48 ÂÂ 4% perf-profile.children.cycles-pp.tcp_sendmsg_locked
ÂÂÂÂÂÂ 0.25 Â 18%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.59 ÂÂ 3% perf-profile.children.cycles-pp.tcp_sendmsg
ÂÂÂÂÂÂ 0.26 Â 16%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.60 ÂÂ 3% perf-profile.children.cycles-pp.sock_sendmsg
ÂÂÂÂÂÂ 1.19 ÂÂ 5%ÂÂÂÂÂ +0.5ÂÂÂÂÂÂÂ 1.68 ÂÂ 3% perf-profile.self.cycles-pp.get_page_from_freelist
ÂÂÂÂÂÂ 6.10ÂÂÂÂÂÂÂÂÂÂÂ +3.2ÂÂÂÂÂÂÂ 9.27 ÂÂ 4% perf-profile.self.cycles-pp.memcpy_erms


On 7/9/2019 10:39 AM, Xing Zhengjun wrote:
Hi Trond,

On 7/8/2019 7:44 PM, Trond Myklebust wrote:
I've asked several times now about how to interpret your results. As far as I can tell from your numbers, the overhead appears to be entirely contained in the NUMA section of your results.
IOW: it would appear to be a scheduling overhead due to NUMA. I've been asking whether or not that is a correct interpretation of the numbers you published.
Thanks for your feedback. I used the same hardware and the same test parameters to test the two commits:
ÂÂÂ e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()")
ÂÂÂ 0472e47660 ("SUNRPC: Convert socket page send code to use iov_iter()")

If it is caused by NUMA, why only commit 0472e47660 throughput is decreased? The filesystem we test is NFS, commit 0472e47660 is related with the network, could you help to check if have any other clues for the regression. Thanks.




--
Zhengjun Xing