Re: [lkp] [tcp] 3613b3dbd1: netpipe.bigger_5M_Mbps.avg -5.5% regression

From: Eric Dumazet
Date: Wed Oct 19 2016 - 01:11:16 EST


On Wed, Oct 19, 2016 at 3:54 AM, kernel test robot
<xiaolong.ye@xxxxxxxxx> wrote:
>
> FYI, we noticed a -5.5% regression of netpipe.bigger_5M_Mbps.avg due to commit:
>
> commit 3613b3dbd1ade9a6a626dae1f608c57638eb5e8a ("tcp: prepare skbs for better sack shifting")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: netpipe
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 6G
> with following parameters:
>
> test: tcp
> cluster: cs-localhost
>
> NetPIPE is a protocol independent performance tool that encapsulates the best of ttcp and netperf and visually represents the network performance under a variety of conditions.
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> cluster/compiler/kconfig/rootfs/tbox_group/test/testcase:
> cs-localhost/gcc-6/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/vm-lkp-a08/tcp/netpipe
>
> commit:
> e812bd905a (" wireless-drivers-next patches for 4.9")
> 3613b3dbd1 ("tcp: prepare skbs for better sack shifting")
>
> e812bd905a5cf00f 3613b3dbd1ade9a6a626dae1f6
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> %stddev %change %stddev
> \ | \
> 2427 Ä 0% -5.5% 2294 Ä 0% netpipe.bigger_5M_Mbps.avg
> 1764 Ä 1% +5.0% 1853 Ä 0% netpipe.time.elapsed_time
> 1764 Ä 1% +5.0% 1853 Ä 0% netpipe.time.elapsed_time.max
> 84.75 Ä 0% -2.1% 83.00 Ä 0% netpipe.time.percent_of_cpu_this_job_got
> 1497 Ä 1% +3.2% 1546 Ä 0% netpipe.time.system_time
> 1378335 Ä 1% +28.6% 1772455 Ä 1% netpipe.time.voluntary_context_switches
> 14140 Ä 3% +6.6% 15079 Ä 2% interrupts.CAL:Function_call_interrupts
> 393.25 Ä 2% -29.1% 279.00 Ä 2% slabinfo.kmalloc-2048.active_objs
> 412.00 Ä 1% -32.3% 279.00 Ä 2% slabinfo.kmalloc-2048.num_objs
> 4069 Ä 0% +17.5% 4779 Ä 0% vmstat.system.cs
> 2961 Ä 0% +5.3% 3118 Ä 0% vmstat.system.in
> 229439 Ä 94% +6926.4% 16121408 Ä172% latency_stats.avg.max
> 141971 Ä160% +11239.9% 16099418 Ä172% latency_stats.avg.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 90183 Ä153% -55.4% 40178 Ä 84% latency_stats.avg.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 Ä 21% +1114.0% 10188 Ä125% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 414419 Ä 86% +7661.1% 32163484 Ä172% latency_stats.max.max
> 257867 Ä166% +12362.7% 32137192 Ä172% latency_stats.max.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 81137 Ä155% -94.7% 4265 Ä 45% latency_stats.max.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 90183 Ä153% -55.4% 40178 Ä 84% latency_stats.max.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 Ä 21% +1114.0% 10188 Ä125% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 6232 Ä 87% -100.0% 0.00 Ä -1% latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.trace_do_page_fault.do_async_page_fault.async_page_fault.skb_copy_datagram_iter.tcp_rcv_established.tcp_v4_do_rcv.tcp_prequeue_process.tcp_recvmsg.inet_recvmsg.sock_recvmsg
> 1107 Ä 62% +964.9% 11788 Ä 55% latency_stats.sum.call_rwsem_down_read_failed.proc_pid_cmdline_read.__vfs_read.vfs_read.SyS_read.do_syscall_64.return_from_SYSCALL_64
> 276091 Ä166% +11542.5% 32143926 Ä172% latency_stats.sum.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 97392 Ä128% -82.9% 16660 Ä 29% latency_stats.sum.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 90183 Ä153% -55.4% 40178 Ä 84% latency_stats.sum.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 Ä 21% +1114.0% 10188 Ä125% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 3561 Ä 11% +12.9% 4020 Ä 6% sched_debug.cpu.curr->pid.avg
> 9783 Ä 6% +11.8% 10942 Ä 3% sched_debug.cpu.curr->pid.max
> 4222 Ä 7% +11.3% 4701 Ä 3% sched_debug.cpu.curr->pid.stddev
> 3988122 Ä 13% +24.4% 4962088 Ä 4% sched_debug.cpu.max_idle_balance_cost.max
> 1330791 Ä 13% +30.8% 1740923 Ä 4% sched_debug.cpu.max_idle_balance_cost.stddev
> 576379 Ä 1% +9.7% 632485 Ä 3% sched_debug.cpu.nr_load_updates.max
> 987546 Ä 1% +20.0% 1184637 Ä 1% sched_debug.cpu.nr_switches.avg
> 1074939 Ä 2% +23.9% 1332366 Ä 3% sched_debug.cpu.nr_switches.max
> 67934 Ä 31% +110.9% 143278 Ä 5% sched_debug.cpu.nr_switches.stddev
> -15.31 Ä-32% -41.6% -8.93 Ä-57% sched_debug.cpu.nr_uninterruptible.min
> 990916 Ä 1% +20.2% 1190867 Ä 1% sched_debug.cpu.sched_count.avg
> 1079929 Ä 2% +24.2% 1341786 Ä 3% sched_debug.cpu.sched_count.max
> 69632 Ä 31% +111.3% 147114 Ä 5% sched_debug.cpu.sched_count.stddev
> 386683 Ä 1% +25.2% 483992 Ä 2% sched_debug.cpu.sched_goidle.avg
> 424674 Ä 1% +29.4% 549482 Ä 1% sched_debug.cpu.sched_goidle.max
> 33344 Ä 23% +107.5% 69183 Ä 9% sched_debug.cpu.sched_goidle.stddev
> 513012 Ä 1% +19.1% 610887 Ä 1% sched_debug.cpu.ttwu_count.avg
> 570351 Ä 5% +24.2% 708599 Ä 5% sched_debug.cpu.ttwu_count.max
> 40059 Ä 42% +112.3% 85064 Ä 14% sched_debug.cpu.ttwu_count.stddev
>
>
>
>
> Thanks,
> Xiaolong

Thanks for the report.

I could not reproduce your results, assuming it is netperf over loopback.

Problem with synthetic tests like that is that a small change can
cascade effects.

In your stats I can see a high number of context switches that a
normal netperf does not exhibit.

Also, maybe the problem is because you are using VM, maybe some NUMA
pinning or something...


Small hack to be able to enable/disable my patch, without having to
reboot and thus have less variability.

edumazet@manihi:/ssd/git/net-next$ git diff net/ipv4/tcp.c
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3251fe71f39f..55afe7b16016 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1035,7 +1035,7 @@ EXPORT_SYMBOL(tcp_sendpage);
*/
static int linear_payload_sz(bool first_skb)
{
- if (first_skb)
+ if (first_skb || (netdev_max_backlog & 1))
return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
return 0;
}

Then :

lpk51:~# echo 1000 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 45854.74 5.90 5.90 0.337 0.337

Performance counter stats for 'system wide':

20005.989547 task-clock (msec) # 2.000 CPUs
utilized
1,568,042 context-switches # 0.078 M/sec
2 cpu-migrations # 0.000 K/sec
121 page-faults # 0.006 K/sec
69,260,921,315 cycles # 3.462 GHz
[83.33%]
49,197,585,858 stalled-cycles-frontend # 71.03% frontend
cycles idle [83.34%]
30,069,741,080 stalled-cycles-backend # 43.42% backend
cycles idle [66.67%]
32,341,104,986 instructions # 0.47 insns per
cycle
# 1.52 stalled
cycles per insn [83.34%]
5,887,445,456 branches # 294.284 M/sec
[83.34%]
65,533,193 branch-misses # 1.11% of all
branches [83.33%]

10.003057093 seconds time elapsed

lpk51:~# echo 1001 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 45267.63 5.94 5.94 0.344 0.344

Performance counter stats for 'system wide':

20006.713062 task-clock (msec) # 2.000 CPUs
utilized
1,466,531 context-switches # 0.073 M/sec
2 cpu-migrations # 0.000 K/sec
120 page-faults # 0.006 K/sec
69,743,911,198 cycles # 3.486 GHz
[83.33%]
49,509,413,005 stalled-cycles-frontend # 70.99% frontend
cycles idle [83.34%]
28,186,805,983 stalled-cycles-backend # 40.41% backend
cycles idle [66.67%]
31,662,971,822 instructions # 0.45 insns per
cycle
# 1.56 stalled
cycles per insn [83.34%]
5,751,313,010 branches # 287.469 M/sec
[83.34%]
69,213,717 branch-misses # 1.20% of all
branches [83.33%]

10.002924757 seconds time elapsed