Re: x86 memcpy performance

From: Borislav Petkov
Date: Fri Sep 09 2011 - 09:43:51 EST


On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote:
> This specific one happened far more than any of the other memcpy usages, and
> ignoring the check when destination is page aligned, most of them are gone.
>
> In short: I don't think I can get a speedup by using avx memcpy in-kernel.
>
> YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
> case, but for the common aligned cases too. Or some concrete numbers that misaligned
> happens a lot for you.

Actually,

assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:

http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)

to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...

Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:

30037(12/44) 5566.4 12797.2 2.299011642
28672(12/44) 5512.97 12588.7 2.283467991
30037(28/60) 5610.34 12732.7 2.269502799
27852(12/44) 5398.36 12242.4 2.267803859
30037(4/36) 5585.02 12598.6 2.25578257
28672(28/60) 5499.11 12317.5 2.239914033
27852(28/60) 5349.78 11918.9 2.227919527
27852(20/52) 5335.92 11750.7 2.202186795
24576(12/44) 4991.37 10987.2 2.201247446

and this is pretty cool. Here are the (0/0) cases:

8192(0/0) 2627.82 3038.43 1.156255766
12288(0/0) 3116.62 3675.98 1.179475031
13926(0/0) 3330.04 4077.08 1.224334839
14336(0/0) 3377.95 4067.24 1.204055286
15018(0/0) 3465.3 4215.3 1.216430725
16384(0/0) 3623.33 4442.38 1.226050715
24576(0/0) 4629.53 6021.81 1.300737559
27852(0/0) 5026.69 6619.26 1.316823133
28672(0/0) 5157.73 6831.39 1.324495749
30037(0/0) 5322.01 6978.36 1.3112261

It is not 2x anymore but still.

Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...

Thanks.

--
Regards/Gruss,
Boris.
Bytes Count
===== =====
0 5447
1 3850
2 16255
3 11113
4 68870
5 4256
6 30433
7 19188
8 50490
9 5999
10 78275
11 5628
12 6870
13 7371
14 4742
15 4911
16 143835
17 14096
18 1573
19 13603
20 424321
21 741
22 584
23 450
24 472
25 685
26 367
27 365
28 333
29 301
30 300
31 269
32 489
33 272
34 266
35 220
36 239
37 209
38 249
39 235
40 207
41 181
42 150
43 98
44 194
45 66
46 62
47 52
48 67226
49 138
50 171
51 26
52 20
53 12
54 15
55 4
56 13
57 8
58 6
59 6
60 115
61 10
62 5
63 12
64 67353
65 6
66 2363
67 9
68 11
69 6
70 5
71 6
72 10
73 4
74 9
75 8
76 4
77 6
78 3
79 4
80 3
81 4
82 4
83 4
84 4
85 8
86 6
87 2
88 3
89 2
90 2
91 1
92 9
93 1
94 2
96 2
97 2
98 3
100 2
102 1
104 1
105 1
106 1
107 2
109 1
110 1
111 1
112 1
113 2
115 2
117 1
118 1
119 1
120 14
127 1
128 1
130 1
131 2
134 2
137 1
144 100092
149 1
151 1
153 1
158 1
185 1
217 4
224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708