Re: Efficient x86 and x86_64 NOP microbenchmarks

From: Linus Torvalds
Date: Wed Aug 13 2008 - 14:29:23 EST




On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
>
> I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> Intel Pentium 4 boxes to compare a baseline

Note that the biggest problems of a jump-based nop are likely to happen
when there are I$ misses and/or when there are other jumps involved. Ie a
some microarchitectures tend to have issues with jumps to jumps, or when
there are multiple control changes in the same (possibly partial)
cacheline because the instruction stream prediction may be predecoded in
the L1 I$, and multiple branches in the same cacheline - or in the same
execution cycle - can pollute that kind of thing.

So microbenchmarking this way will probably make some things look
unrealistically good.

On the P4, the trace cache makes things even more interesting, since it's
another level of I$ entirely, with very different behavior for the hit
case vs the miss case.

And I$ misses for the kernel are actually fairly high. Not in
microbenchmarks that tend to have very repetive behavior and a small I$
footprint, but in a lot of real-life loads the *bulk* of all action is in
user space, and then the kernel side is often invoced with few loops (the
kernel has very few loops indeed) and a cold I$.

So your numbers are interesting, but it would be really good to also get
some info from Intel/AMD who may know about microarchitectural issues for
the cases that don't show up in the hot-I$-cache environment.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/