Speed of memcpy, csum_partial and csum_partial_copy

Jamie Lokier (jamie@rebellion.co.uk)
Fri, 7 Jun 96 18:19 BST


I have a couple of suggestions regarding the speed of these functions on
486s and Pentiums. I don't intend to look into this, but someone else
may be interested.

Someone from Intel told me that the fastest way to copy memory on a
Pentium is to preload about a page's worth of data into the cache, by
touching every 32nd byte. Then proceed with the fastest copy loop you
can, to saturate the write buffers. Alternate as necessary for large
copies. This is supposed to be faster because you avoid most of the
DRAM page misses when turning around from read to write, and vice versa.
The example timings for page misses that he quoted would seem to bear
this out as worthwhile. More so than using 64-bit writes, on a fast
Pentium. (He said the chipset would merge 32-bit writes once they got
out of the CPU anyway).

I haven't tried this of course. But then I don't have a Pentium.
(Altough the idea looks sound on a 486 too).

My point is that you may be able to make memcpy, csum_partial and
csum_partial_copy faster by preloading the data into the cache -- just
touch every 32nd byte before the csum+copy loop.

(On the 486, touch every 16th byte and don't load so much because the
cache isn't as large).

My other point is that csum_partial_copy looks like it could run 33%
faster, when everything is in the CPU cache, by rearranging the
instructions in the loop to pair fully -- I think that adcl can pair but
only in the U pipe. (I could be wrong about this. If adcl can pair
anywhere, there are still write-then-read dependencies that prevent
pairing in that code).

-- Jamie