Re: Interesting pentium-memcpy results

Ingo Molnar (
Tue, 29 Jul 1997 11:43:48 +0200 (MET DST)

On Tue, 29 Jul 1997, Albert D. Cahalan wrote:

> I think it shows that the memcpy size test is significant.
> Perhaps the FPU is best used only when explicitly requested
> for large operations. That would mean page clearing I guess.

if you take a look at the patch, you would see the following code:

+__memcpy_g (void *_to, const void *_from, __kernel_size_t _bytes)
+ if (bytes >= 1024) {

> big_aligned_memcpy() and big_aligned_clear() perhaps?
> For 512 bytes and up, optimized for each arch.

the break even point i think is around a few hundreds of bytes, but for
1024 bytes it's clearly faster even in the worst case.

> There may be a conflict with the user-space version.
> With both the kernel and apps abusing the FPU for memcpy,
> the FPU must be restored too often.

an 'fsave/frestore' takes some ~200 cycles. [btw, instead of fsave, why
doesnt the patch save the FPU state manually, thats should _much_ faster,
me thinks].

Copying 1024 bytes takes ~2000 cycles when hot cache, ~4000 cycles when
cold cache. [typical midrange pentium numbers]. So the FPU method has to
be only 10% faster to compensate for the cost. And according to the README
it's 35% faster.

what i find a bit interesting is the copying pattern, it's a strange 'comb
pattern', which might fool smarter (PPro) speculative reads ... but i have
not measured this yet, it's just a question. Accessing different
cachelines in successive instructions does have an advantage, but the way
the patch does it seems to be pretty aggressive.

-- mingo