Re: Pentium II optimization (clc vs testl)

Kris Karas (mingo@chiara.csoma.elte.hu)
Fri, 10 Sep 1999 18:56:07 +0200 (CEST)


On Fri, 10 Sep 1999, Manfred Spraul wrote:

> but I think you read the wrong appendix:
> APPENDIX A and B are for the Pentium CPU,
> APPENDIX C and D are for the Pentium Pro.

you are right - and i'll quote an answer i gave previously (and which
didnt go to linux-kernel):

------------>
> [...] On PPro+ I'd expect clc to be imperceptibly better
> simply because it is smaller (1 byte vs. 2)

oops, ugly thinko on my side -- i did use the wrong document. The P6 core
does have two pipelines in the sense that 2 integer uops can be passed in
per cycle, and 2 uops can be retired per cycle, but due to instruction
reordering the classic meaning (and restrictions) do not apply. And AFAIR
the P6 core can retire only two uops/cycle, so the 3 uops/cycle decoding
speed cannot be saturated in a stationary way. (wasnt it 2 integer decodes
and 1 FPU/MMX decode per cycle?) Anyway, i previously measured the
overhead of clc vs. testl %X, %X before posting, and the testl version
performs better here - maybe you can explain why. I've attached the code,
the OVERHEAD #define is hand-tailored (with empty measured section the
result should be 0 cycles) to my box - this can be different on other
boxes. The BARRIER thing might looks curious, but rdtsc has to be shielded
from the measured section, otherwise rdtsc's uops might mix up and
interact with the measured section - causing false results.

anyway, the code produces 23 cycles with the clc variant, and 17 cycles
with the testl variant (change the #if 1 define to #if 0). Both
benchmarked sections execute 12 clc's/testl's in a row. I dont think there
should be any alignment issues here, but i could be wrong. Run it as root
so that cli can execute.

-- mingo

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>

static unsigned long long t1,t2;

#define ASM(x) __asm__ __volatile__ (x :::"memory")
#define CYCLES(x) __asm__ __volatile__ ("rdtsc" :"=a" (*(((int*)&x)+0)), "=d" (*(((int*)&x)+1))::"memory")

#define BARRIER() asm volatile ("cli; cli;")

void main()
{
volatile long long i, j, min=1000000;

iopl(3);
for (i=0; i<100000; i++) {
BARRIER(); CYCLES(t1); BARRIER();
#if 1
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
ASM("clc; clc;");
#else
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
ASM("testl %%esi, %%esi;testl %%esi, %%esi;");
#endif

BARRIER(); CYCLES(t2); BARRIER();
#define OVERHEAD 64
if (t2-t1 < min) {
min = t2-t1;
printf("%Lx %Lx -> %Ld cycles\n", t1,t2,min-OVERHEAD);
}
}

printf("best latency: %Ld cycles\n", min-OVERHEAD);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/