> I think it shows that the memcpy size test is significant.
> Perhaps the FPU is best used only when explicitly requested
> for large operations. That would mean page clearing I guess.
if you take a look at the patch, you would see the following code:
+__memcpy_g (void *_to, const void *_from, __kernel_size_t _bytes)
+{
+ if (bytes >= 1024) {
> big_aligned_memcpy() and big_aligned_clear() perhaps?
> For 512 bytes and up, optimized for each arch.
the break even point i think is around a few hundreds of bytes, but for
1024 bytes it's clearly faster even in the worst case.
> There may be a conflict with the user-space version.
> With both the kernel and apps abusing the FPU for memcpy,
> the FPU must be restored too often.
an 'fsave/frestore' takes some ~200 cycles. [btw, instead of fsave, why
doesnt the patch save the FPU state manually, thats should _much_ faster,
me thinks].
Copying 1024 bytes takes ~2000 cycles when hot cache, ~4000 cycles when
cold cache. [typical midrange pentium numbers]. So the FPU method has to
be only 10% faster to compensate for the cost. And according to the README
it's 35% faster.
what i find a bit interesting is the copying pattern, it's a strange 'comb
pattern', which might fool smarter (PPro) speculative reads ... but i have
not measured this yet, it's just a question. Accessing different
cachelines in successive instructions does have an advantage, but the way
the patch does it seems to be pretty aggressive.
-- mingo