Performance differences in recent kernels

From: rwhron@earthlink.net
Date: Tue Sep 10 2002 - 22:54:00 EST


Just to note a few differences in recent benchmarks on quad xeon
with 3.75 gb ram and qlogic 2200 -> raid 5 array.

For AIM7, the outstanding metrics are jobs/min (high is good),
and cpu time (in seconds). The tasks column is equivalent to
load average.

AIM7 database workload

Andrea's tree has the v6.0 qlogic driver which helps i/o a lot.
It's the only tree with that driver atm. The other trees look
pretty similar at load averages of 32 and 256.

kernel Tasks Jobs/Min Real CPU
2.4.19-rc5-aa1 32 555.4 342.2 146.1
2.4.20-pre5 32 470.7 403.8 147.2
2.4.20-pre4-ac1 32 472.0 402.7 142.4
2.5.33-mm5 32 474.4 400.7 144.2

2.4.19-rc5-aa1 256 905.2 1679.9 931.9
2.4.20-pre5 256 769.1 1977.0 1048.5
2.4.20-pre4-ac1 256 766.4 1984.2 945.5
2.5.33-mm5 256 763.0 1992.9 1020.8

AIM7 file server workload

Interesting here to note that with low load averages,
2.5.33-mm5 is on top, but as load average increases, -aa is
ahead.

kernel Tasks Jobs/Min Real CPU
2.4.19-rc5-aa1 4 131.6 184.2 45.5
2.4.20-pre5 4 132.7 182.7 44.1
2.4.20-pre4-ac1 4 132.7 182.6 46.0
2.5.33-mm5 4 140.4 172.6 37.7

2.4.19-rc5-aa1 32 264.8 732.3 219.1
2.4.20-pre5 32 230.5 841.5 265.7
2.4.20-pre4-ac1 32 227.7 851.6 257.6
2.5.33-mm5 32 229.8 843.7 224.7

AIM7 shared multiuser workload

This is more cpu intensive than the other aim7 workloads.
2.5.33-mm5 is using a lot more cpu time. That may be a bug in
the workload. I'm investigating that.

kernel Tasks Jobs/Min Real CPU
2.4.19-rc5-aa1 64 2319.6 160.6 163.8
2.4.20-pre4-ac1 64 1960.4 190.0 164.8
2.4.20-pre5 64 1980.3 188.1 185.1
2.5.33-mm5 64 1461.2 254.9 566.2

2.4.19-rc5-aa1 256 2835.5 525.5 652.6
2.4.20-pre4-ac1 256 2444.2 609.6 656.6
2.4.20-pre5 256 2432.8 612.4 701.0
2.5.33-mm5 256 1890.5 788.1 2316.4

IRMAN - interactive response measurement.
2.5.33-mm5 has much lower max response time for file io.
The standard deviation is very low too (which is good).

                   FILE_IO Response time measurements (milliseconds)
                           Max Min Avg StdDev
2.4.20-pre4-ac1 40.603 0.008 0.009 0.043
2.4.20-pre5 52.405 0.009 0.011 0.080
2.5.33-mm5 2.955 0.008 0.010 0.004

autoconf-2.53 build (12 times) creates about 1.2 million processes.
It's a good fork test. rmap slows this one down. There is a healthy
difference between the rmap in 2.5.33-mm5 and 2.4.20-pre4-ac1.

kernel seconds (smaller is better)
2.4.20-pre4-ac1 856.4
2.4.19-rc5-aa1 727.2
2.4.20-pre5 718.4
2.5.33 799.2
2.5.33-mm5 782.0

Time to build the kernel 12 times. Not a lot of difference here.

kernel seconds
2.4.19-rc5-aa1 718.8
2.4.20-pre4-ac1 735.8
2.4.20-pre5 728.1
2.5.33 728.2
2.5.33-mm5 736.8

The Open Source database benchmark doesn't vary much between trees.

dbench on various filesystems. This isn't meant to compare
filesystem because the disk geometry is different for each fs.

rmap has generally not done well on dbench when the process
count is high, but 2.5.33* on ext2 and ext3 really smokes at
64 processes.

dbench ext2 64 processes Average (5 runs)
2.4.19-rc5-aa1 179.61 MB/second
2.4.20-pre4-ac1 140.63
2.4.20-pre5 145.00
2.5.33 220.54
2.5.33-mm5 214.78

dbench ext2 192 processes Average
2.4.19-rc5-aa1 155.44
2.4.20-pre4-ac1 79.16
2.4.20-pre5 115.31
2.5.33 134.27
2.5.33-mm5 174.17

dbench ext3 64 processes Average
2.4.19-rc5-aa1 97.69
2.4.20-pre4-ac1 59.42
2.4.20-pre5 80.79
2.5.33-mm5 112.20

dbench ext3 192 processes Average
2.4.19-rc5-aa1 77.06
2.4.20-pre4-ac1 28.48
2.4.20-pre5 58.66
2.5.33-mm5 72.92

dbench reiserfs 64 processes Average
2.4.19-rc5-aa1 70.50
2.4.20-pre4-ac1 57.30
2.4.20-pre5 62.60
2.5.33-mm5 77.22

dbench reiserfs 192 processes Average
2.4.19-rc5-aa1 55.37
2.4.20-pre4-ac1 20.56
2.4.20-pre5 44.14
2.5.33-mm5 49.61

The O(1) scheduler helps tbench a lot when the process
count is high. The ac tree may not have the latest
scheduler updates.

tbench 192 processes Average
2.4.19-rc5-aa1 116.76
2.4.20-pre4-ac1 100.30
2.4.20-pre5 27.98
2.5.33 115.93
2.5.33-mm5 117.91

LMbench latency running /bin/sh had a big regression in the
-mm tree recently.

                      fork execve /bin/sh
kernel process process process
------------------ ------- ------- -------
2.4.19-rc5-aa1 186.8 883.1 3937.9
2.4.20-pre4-ac1 227.9 904.5 3866.0
2.4.20-pre5 310.0 990.9 4178.1
2.5.33-mm5 244.3 949.0 71588.2

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
                   32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch
---------------- --------- --------- ---------
2.4.19-rc5-aa1 35.411 65.120 64.686
2.4.20-pre4-ac1 30.642 49.307 56.068
2.4.20-pre5 17.716 27.205 43.716
2.5.33-mm5 21.786 49.555 63.000

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
                   16prc/64k 32prc/64k 64prc/64k
kernel ctx swtch ctx swtch ctx swtch
---------------- --------- --------- ---------
2.4.19-rc5-aa1 50.523 111.320 137.383
2.4.20-pre4-ac1 50.691 92.204 122.261
2.4.20-pre5 36.763 44.498 111.952
2.5.33-mm5 27.113 42.679 124.907

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
The -aa tree higher latency for file creation. File delete latency is
similar for all trees. 2.4.20-pre5 has the lowest mmap latency, 2.5.33-mm5
the highest.

                   0K 1K 10K 10K Mmap Page
kernel Create Create Create Delete Latency Fault
---------------- ------- ------- ------- ------- ------- ------
2.4.19-rc5-aa1 126.57 174.70 256.64 62.50 3728.2 4.00
2.4.20-pre4-ac1 86.92 137.28 217.73 61.22 3557.2 3.00
2.4.20-pre5 90.24 140.22 219.17 61.38 2673.8 3.00
2.5.33-mm5 93.43 143.58 225.19 63.83 4634.7 4.00

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
2.5.33-mm5 has significanly lower latency here, except for tcp connection.

kernel Pipe AF/Unix UDP TCP RPC/TCP TCPconn
----------------- ------- ------- ------- ------- ------- -------
2.4.19-rc5-aa1 36.697 48.436 55.3271 50.8352 80.8498 88.330
2.4.20-pre4-ac1 34.110 56.582 53.9643 54.7447 84.4660 86.195
2.4.20-pre5 10.819 25.379 38.4917 45.2661 79.1166 86.745
2.5.33-mm5 8.337 14.122 23.6442 35.4457 77.0814 111.252

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
                                            
kernel Pipe AF/Unix TCP
----------------- ------- ------- -------
2.4.19-rc5-aa1 541.56 253.43 166.08
2.4.20-pre4-ac1 552.99 240.54 168.34
2.4.20-pre5 462.82 273.55 161.28
2.5.33-mm5 515.64 543.57 171.01

tiobench-0.3.3 is create 12 gigabytes worth of files.

Unit information
================
Rate = megabytes per second
CPU% = percentage of CPU used during the test
Latency = milliseconds
Lat% = percent of requests that took longer than 10 seconds
CPU Eff = Rate divided by CPU% - throughput per cpu load

Sequential Reads ext2
2.5.33-mm5 has much lower max latency when the thread count is high for
sequentional reads. The qlogic driver in -aa helps a lot here too.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- ---------------------------------------------------

2.4.19-rc5-aa1 1 51.21 28.87% 0.226 103.26 0.00000 177
2.4.20-pre4-ac1 1 34.14 17.25% 0.341 851.34 0.00000 198
2.4.20-pre5 1 33.68 20.36% 0.345 110.11 0.00000 165
2.5.33 1 25.36 13.67% 0.460 1512.99 0.00000 185
2.5.33-mm5 1 31.73 14.80% 0.367 853.99 0.00000 214

2.4.19-rc5-aa1 256 40.68 25.39% 64.084 107977.97 0.36264 160
2.4.20-pre4-ac1 256 34.51 19.63% 51.031 845159.88 0.02919 176
2.4.20-pre5 256 31.89 22.95% 57.236 849792.70 0.03459 139
2.5.33 256 24.54 14.46% 94.422 449274.89 0.09794 170
2.5.33-mm5 256 22.39 18.56% 104.515 24623.21 0.00000 121

Sequential Writes ext2
There is a dramatic reduction in cpu utilization in 2.5.33-mm5 and increase in
throughput compared to 2.5.33 when thread count is high.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- ---------------------------------------------------
2.4.19-rc5-aa1 128 37.40 45.99% 32.405 46333.30 0.00105 81
2.4.20-pre4-ac1 128 34.01 36.94% 40.121 47331.57 0.00058 92
2.4.20-pre5 128 32.98 49.33% 39.692 52093.19 0.01446 67
2.5.33 128 12.17 222.9% 108.966 910455.61 0.19503 5
2.5.33-mm5 128 30.78 30.03% 32.973 909931.81 0.07858 102

Sequential Reads ext3
2.5.33-mm5 has a more graceful degradation in throughput on ext3.
Fairness is better too.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- ---------------------------------------------------
2.4.19-rc5-aa1 1 51.13 29.59% 0.227 460.92 0.00000 173
2.4.20-pre4-ac1 1 34.12 17.37% 0.341 1019.65 0.00000 196
2.4.20-pre5 1 33.28 20.62% 0.350 137.44 0.00000 161
2.5.33-mm5 1 31.70 14.75% 0.367 581.89 0.00000 215

2.4.19-rc5-aa1 64 7.38 4.51% 98.947 20638.56 0.00000 164
2.4.20-pre4-ac1 64 6.55 3.94% 110.432 14937.49 0.00000 166
2.4.20-pre5 64 6.34 4.16% 111.299 14234.83 0.00000 152
2.5.33-mm5 64 12.29 8.51% 55.372 8799.99 0.00000 144

Sequential Writes ext3
Here 2.5.33-mm5 is great with 1 thread, but takes a hit at 32 threads.
Latency is pretty high too. Cpu utilization is quite low though.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- ---------------------------------------------------
2.4.19-rc5-aa1 1 44.23 53.01% 0.243 6084.88 0.00000 83
2.4.20-pre4-ac1 1 37.86 50.66% 0.300 4288.99 0.00000 75
2.4.20-pre5 1 37.58 55.38% 0.295 14659.06 0.00003 68
2.5.33-mm5 1 54.16 65.87% 0.211 5605.87 0.00000 82

2.4.19-rc5-aa1 32 20.86 121.6% 8.861 13693.99 0.00000 17
2.4.20-pre4-ac1 32 28.33 156.6% 10.041 15724.46 0.00000 18
2.4.20-pre5 32 22.36 114.3% 10.382 12867.96 0.00000 20
2.5.33-mm5 32 5.90 11.67% 52.386 1150696.62 0.08252 50

Sequential Reads on reiserfs
Don't know what happened to the 2.5 numbers here.
-aa has much higher throughput at high thread count,
but I believe that's a reiserfs change that is fixed in 2.4.20-pre6.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- ---------------------------------------------------
2.4.19-rc5-aa1 1 48.21 30.97% 0.241 104.82 0.00000 156
2.4.20-pre4-ac1 1 33.65 19.27% 0.346 136.95 0.00000 175
2.4.20-pre5 1 35.25 23.00% 0.330 492.30 0.00000 153

2.4.19-rc5-aa1 32 36.27 25.59% 9.946 12613.17 0.00000 142
2.4.20-pre4-ac1 32 7.08 4.73% 51.894 5808.95 0.00000 149
2.4.20-pre5 32 6.74 5.16% 53.395 8148.47 0.00000 131

Sequential Writes reiserfs - max latency is very high for everyone here.

                   Num Avg Maximum Lat% CPU
Kernel Thr Rate (CPU%) Latency Latency >10s Eff
------------------ --- --------------------------------------------------

2.4.19-rc5-aa1 256 31.90 121.9% 67.227 166079.82 0.28051 26
2.4.20-pre4-ac1 256 23.83 128.1% 84.309 135202.89 0.27039 19
2.4.20-pre5 256 18.23 88.00% 76.265 258230.65 0.26893 21

More details and more kernel tests at:
http://home.earthlink.net/~rwhron/kernel/bigbox.html

-- 
Randy Hron

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:23 EST