Re: x86 memcpy performance

From: Borislav Petkov
Date: Sun Aug 14 2011 - 06:04:05 EST


On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote:
> Sounds very interesting - it would be nice to see 'perf record' +
> 'perf report' profiles done on that workload, before and after your
> patches.

FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:

First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.

On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.

I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:

Benchmarking with 10000 iterations, average results:
size XM MM speedup
119 540.58 449.491 0.8314969419
189 296.318 263.507 0.8892692985
206 297.949 271.399 0.9108923485
224 255.565 235.38 0.9210161798
221 299.383 276.628 0.9239941159
245 299.806 279.432 0.9320430545
369 314.774 316.89 1.006721324
425 327.536 330.475 1.00897153
439 330.847 334.532 1.01113687
458 333.159 340.124 1.020904708
503 334.44 352.166 1.053003229
767 375.612 429.949 1.144661625
870 358.888 312.572 0.8709465025
882 394.297 454.977 1.153893229
925 403.82 472.56 1.170222413
1009 407.147 490.171 1.203915735
1525 512.059 660.133 1.289174911
1737 556.85 725.552 1.302958536
1778 533.839 711.59 1.332965994
1864 558.06 745.317 1.335549882
2039 585.915 813.806 1.388949687
3068 766.462 1105.56 1.442422252
3471 883.983 1239.99 1.40272883
3570 895.822 1266.74 1.414057295
3748 906.832 1302.4 1.436212771
4086 957.649 1486.93 1.552686041
6130 1238.45 1996.42 1.612023046
6961 1413.11 2201.55 1.557939181
7162 1385.5 2216.49 1.59977178
7499 1440.87 2330.12 1.617158856
8182 1610.74 2720.45 1.688950194
12273 2307.86 4042.88 1.751787902
13924 2431.8 4224.48 1.737184756
14335 2469.4 4218.82 1.708440514
15018 2675.67 1904.07 0.711622886
16374 2989.75 5296.26 1.771470902
24564 4262.15 7696.86 1.805863077
27852 4362.53 3347.72 0.7673805572
28672 5122.8 7113.14 1.388524413
30033 4874.62 8740.04 1.792967931
32768 6014.78 7564.2 1.257603505
49142 14464.2 21114.2 1.459757233
55702 16055 23496.8 1.463523623
57339 16725.7 24553.8 1.46803388
60073 17451.5 24407.3 1.398579162


Size is with randomly generated misalignment to test the implementation.

I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:

with SSE memcpy
===============

Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% )
520,658 context-switches # 0.000 M/sec ( +- 0.25% )
63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% )
26,070,835 page-faults # 0.008 M/sec ( +- 0.00% )
1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%]
551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%]
444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%]
1,488,917,931,766 instructions # 0.82 insns per cycle
# 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%]
340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%]
21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%]

137.567155255 seconds time elapsed ( +- 1.48% )


plain 3.0
=========

Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% )
518,139 context-switches # 0.000 M/sec ( +- 0.32% )
61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% )
26,056,947 page-faults # 0.007 M/sec ( +- 0.00% )
1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%]
557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%]
443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%]
1,469,707,613,500 instructions # 0.80 insns per cycle
# 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%]
335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%]
21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%]

146.025263276 seconds time elapsed ( +- 1.31% )


So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.

Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.

The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.

Comments are much appreciated! :-)

--