Re: Off Topic (MMX Overdrive performance)

Gabriel Paubert (paubert@iram.es)
Thu, 24 Apr 1997 12:12:27 +0200 (METDST)


I answered privately to Michael L. Galbraith <mikeg@weiden.de>, but since
even Linus posts about it:

On 23 Apr 1997, Linus Torvalds wrote:

> In article <Pine.LNX.3.95.970423072107.1678A-100000@mikeg.weiden.de>,
> Michael L. Galbraith <mikeg@weiden.de> wrote:
> >
> >Anyone know what the heck Intel did to the MMX-Overdrive(150) to account
> >for this?
>
> Impressive. They seem to have improved memory read performance

There are a few good reasons for the improvement in performance of
MMX Pentium over non-MMX Pentium and they are not at all related to the
MMX instructions as Intel marketing hype would try to make us believe. MMX
instructions are almost useless (unless you have a very large array to
process) since:
- no compiler will generate them (and I doubt they will ever, and they
can't even be handled easily with GCC's embedded assembly because it
does not even know the existence of MMX registers and their interaction
with floating point),
- they cannot be mixed with floating point,
- switching from floating-point to MMX and back is very expensive,

So the first good reason for MMX Pentia to be faster is that the caches are
twice as large and are 4-way set associative instead of 2-way.
See in this respect a few examples of how easy it is to thrash a
2 way set associative cache from (I have some experience about this,
as 4 way associativity is required for good performance of FFT algorithms):

http://announce.com/agner/assem/pentopt.zip

This text also explains the subtle decoder differences between both versions,
the problems with branch prediction in the original Pentium and other
arcane details and bugs. But IMHO, the main reasons for improved performance
are, in order of decreasing importance:

1) cache associativity
2) cache size
3) improved branch prediction (including return prediction, see pentopt.zip)
4) more writeback buffers (this improves read performance by postponing writes)
5) somewhat more pairing opportunities (displacement+immediate instructions)
6) improved decoder (FIFO stage) (which may also help pairing on first pass ?)
...
+infinity) MMX instructions

The net result is that the MMX Pentium is almost as fast as the Pentium
Pro at the same clock rate while dissipating much less heat.

See also the benchmarks at http://sysdoc.pair.com, including the Pentium II.

Sorry to be off-topic, but this will hopefully stop this thread!

Regards,
Gabriel.

P.S.: BTW does anybody know why some messages come out repeated by the tens,
especially about mprotect and T.Tso's about TLI/streams. Some server around
here must be stammering.