> An unrolled multiply accumulate _can_ be done in 2 clocks per argument on
> a Pentium, however (hint: the fxchg instruction can be made to take 0(!!)
> clocks if ordered properly). I put together a signal processing app that
> did dot products at 45 mflops on a P90 last year. But this was only if its
> working set fit within the L1 cache.
Hm? Let's see: Add throughput is 1 per cycle, Mul throughput is 1 per
cycle,
but when do you fetch the arguments from L1 cache? Or are they already
in registers when you start your algorithm? Care to post your actual
code?
Tom
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html