copy_page(): non-temporal stores look useful?

From: Denis Vlasenko
Date: Fri Aug 20 2004 - 12:33:23 EST


Hi,

I was playing with MMX, SSE and non-temporal page zeroing
and page-copying routines for quite a bit now.

NB: the CPU is Duron Applebred 1750MHz 64k L1 + 64k L2.

In short, this is what I found:

* MMX and SSE routines are very close in speed.
Both are 5-10% faster than "rep stosd/rep movsd" ones.
* Aligning main loop gives 0-2% speedup.
* Prefetching data in copy loop gives 5-10% speedup.
* Using non-temporal stores gives 50-300% speedup,
but copied data end up stored in main memory, not in cache.

Are non-temporal stores a win or a loss? We need to measure
not just a zero_page / copy_page speed, but also speed of
subsequent access for page data. Here's what I found:

* In small-footprint testing which fits entirely in L1 dcache,
NT zero_page() is actually a loss if we are reading zeroed-out
page afterwards. Since CPU typically uses write-allocate,
we end up reading page even if code only writes into it.

* HOWEVER, I was unable to demonstrate that NT copy_page() routine
is a loss when we read copied data back. Whatever I tried (small,
medium, large dataset), NT copy_page() is *faster* than non-NT one.

* This contradicts Arjan's results - he says that NT is a loss.
(I still would like to try to reproduce his testing. Arjan?)

How can that be? Below you will find all the gory details, thoughts,
and means to reproduce and/or test on different CPU.

Typically copy_page is done when kernel need to un-COW
a page when one of the forked processes wants
to write to a page, and I used
alloc(buf);
for(;;) { fork(); use(buf); }
code for copy_page() benchmarking.

It turned out that kernel zeroes out 3 pages and copies 8 more
pages on each fork, and, obviously, kernel does NOT use entire 44k
of resultant data. It accesses some 100 bytes here and there, and
that's all. Thus, testing shows that NT ops are a win for fork().

If I use smallish mmap'ed buffer so that sizeof(buffer)+44k still fits
into L1, speed gains from fork() outweigh losses due to slower access
to NT-copied pages.

If I use larger buffer and sizeof(buffer)+44k does not fit in cache,
NT-copy wins hands down, as expected.

Commented test results below. SSE results are omitted (very close to MMX).
Test programs/scripts and kernel code are in attached tarball.
Each tes was run 5 times, results are sorted.
NNN/MMM == NNN zero_page, MMM copy_page calls were made.
slow == rep stosd / rep movsd routines
mmx_APn == aligned loop, using prefetch, NOT using non-temporal stores
mmx_APN == aligned loop, using prefetch, using non-temporal stores
mmx_APn/APN == use NT only for copy (zero_page is non-NT)

NT zero_page() is a clear loss for small buffer (which is the most
used and most important case):
320k zeroing, 5x20000 loops:
slow: 0m9.584 0m9.612 0m9.635 0m9.687 0m9.887 8000297/497
mmx_APn: 0m8.355 0m8.506 0m8.531 0m8.548 0m8.744 8000266/444
mmx_APN: 0m5.413 0m5.426 0m5.448 0m5.550 0m5.653 8000262/448
32k zeroing, 5x200000 loops:
slow: 0m3.821 0m3.866 0m3.938 0m4.006 0m4.075 8000274/471
mmx_APn: 0m3.547 0m3.622 0m3.798 0m3.876 0m4.102 8000241/414
mmx_APN: 0m5.985 0m6.021 0m6.186 0m6.268 0m6.775 8000267/441

Ok, now, copy_page() testing:
NB: each copy test does one fork per one loop.
With each fork, kernel zeroes out 3 pages and copies 8 pages.
This amounts to 12k+32k bytes.
256k copying, 5x5000 loops:
slow: 0m8.036 0m8.063 0m8.192 0m8.233 0m8.252 75600/1800468
mmx_APn: 0m7.461 0m7.496 0m7.543 0m7.687 0m7.725 75586/1800446
mmx_APN: 0m6.351 0m6.366 0m6.378 0m6.382 0m6.525 75586/1800436
mmx_APn/APN: 0m6.412 0m6.448 0m6.501 0m6.663 0m6.669 75584/1800439
128k copying, 5x5000 loops:
slow: 0m4.732 0m4.747 0m4.751 0m4.773 0m4.776 75466/1000469
mmx_APn: 0m4.258 0m4.331 0m4.343 0m4.386 0m4.422 75406/1000422
mmx_APN: 0m3.658 0m3.672 0m3.784 0m3.798 0m3.818 75452/1000436
mmx_APn/APN: 0m3.713 0m3.713 0m3.840 0m3.850 0m3.857 75435/1000413
64k copying, 5x10000 loops:
slow: 0m5.869 0m5.885 0m5.894 0m5.904 0m5.906 150356/1200472
mmx_APn: 0m5.369 0m5.391 0m5.404 0m5.424 0m5.426 150345/1200444
mmx_APN: 0m4.804 0m4.826 0m4.843 0m4.843 0m4.934 150355/1200436
mmx_APn/APN: 0m4.878 0m4.883 0m4.926 0m4.937 0m4.962 150343/1200441
32k copying, 5x20000 loops:
slow: 0m8.088 0m8.125 0m8.241 0m8.245 0m8.326 300320/1600461
mmx_APn: 0m7.527 0m7.662 0m7.706 0m7.750 0m7.802 300303/1600438
mmx_APN: 0m6.630 0m6.661 0m6.681 0m6.696 0m6.735 300303/1600442
28k copying, 5x20000 loops:
slow: 0m7.678 0m7.804 0m7.817 0m7.888 0m7.920 300329/1500470
mmx_APn: 0m7.064 0m7.084 0m7.138 0m7.268 0m7.338 300298/1500437
mmx_APN: 0m5.956 0m5.960 0m6.048 0m6.107 0m6.200 300297/1500435
20k copying, 5x20000 loops:
slow: 0m6.610 0m6.665 0m6.694 0m6.750 0m6.774 300315/1300468
mmx_APn: 0m6.208 0m6.218 0m6.263 0m6.335 0m6.452 300352/1300448
mmx_APN: 0m4.887 0m4.984 0m5.021 0m5.052 0m5.057 300295/1300443
mmx_APn/APN: 0m5.115 0m5.160 0m5.167 0m5.172 0m5.183 300292/1300443

NB: above: non-NT zero_page does NOT improve (compare mmx_APN and mmx_APn/APN)

4k copying, 5x40000 loops:
slow: 0m8.303 0m8.334 0m8.354 0m8.510 0m8.572 600313/1800473
mmx_APn: 0m8.233 0m8.350 0m8.406 0m8.407 0m8.642 600323/1800467
mmx_APN: 0m6.475 0m6.501 0m6.510 0m6.534 0m6.783 600302/1800436
mmx_APn/APN: 0m6.540 0m6.551 0m6.603 0m6.640 0m6.708 600271/1800442

Dominated by fork() overhead. NT again wins! Probably because fork()
does not read back entire 44k of zeroed/copied memory.

I'd like to try some non-forking workload. Is there ANY workload
which will be faster on non-NT copy_page() than on NT one?

Comments?
--
vda

Attachment: fp.tar.gz
Description: application/tgz