Re: Interesting pentium-memcpy results

Robert L Krawitz (rlk@tiac.net)
Tue, 29 Jul 1997 09:12:04 -0400 (EDT)


Date: Tue, 29 Jul 1997 11:43:48 +0200 (MET DST)
From: Ingo Molnar <mingo@pc7537.hil.siemens.at>

On Tue, 29 Jul 1997, Albert D. Cahalan wrote:

> I think it shows that the memcpy size test is significant.
> Perhaps the FPU is best used only when explicitly requested
> for large operations. That would mean page clearing I guess.

if you take a look at the patch, you would see the following code:

+__memcpy_g (void *_to, const void *_from, __kernel_size_t _bytes)
+{
+ if (bytes >= 1024) {

> big_aligned_memcpy() and big_aligned_clear() perhaps?
> For 512 bytes and up, optimized for each arch.

the break even point i think is around a few hundreds of bytes, but for
1024 bytes it's clearly faster even in the worst case.

On my system (P90, Neptune chipset, memory clocked X-3-3-3) breakeven
is somewhere around 256 bytes. With EDO or SDRAM, the breakeven point
is probably a little higher. I set it to 1024 because there's a
decent chance that that smaller copies are already in cache, and the
FPU memcpy() loses badly if the destination is in cache.

> There may be a conflict with the user-space version.
> With both the kernel and apps abusing the FPU for memcpy,
> the FPU must be restored too often.

an 'fsave/frestore' takes some ~200 cycles. [btw, instead of fsave, why
doesnt the patch save the FPU state manually, thats should _much_ faster,
me thinks].

Unless you can easily be much more selective about what to save, I
suspect it's all dominated by the memory time.

Copying 1024 bytes takes ~2000 cycles when hot cache, ~4000 cycles when
cold cache. [typical midrange pentium numbers]. So the FPU method has to
be only 10% faster to compensate for the cost. And according to the README
it's 35% faster.

35% net, including save/restore.

what i find a bit interesting is the copying pattern, it's a strange 'comb
pattern', which might fool smarter (PPro) speculative reads ... but i have
not measured this yet, it's just a question. Accessing different
cachelines in successive instructions does have an advantage, but the way
the patch does it seems to be pretty aggressive.

I determined this empirically by trying several different patterns.
On my system, this striding pattern does about 20% better than a
straight through pattern.

BTW, when I measured this on a PPro 200, the FPU memcpy() was still
about 10% faster than rep movsd.

-- 
Robert Krawitz <rlk@tiac.net>           http://www.tiac.net/users/rlk/

Tall Clubs International -- http://www.tall.org/ or 1-800-521-2512 Member of the League for Programming Freedom -- mail lpf@uunet.uu.net