Pentium __delay weirdness... the final results (longish)

Gordon Oliver (gordo@telsur.cl)
Fri, 12 Sep 1997 11:37:20 -0400 (CST)


Hi again.

Here's the final summary all numbers are in cycles as reported by rdtsc.
This is using a set of assembly that executes the entire timing loop twice,
in an effort to get the caching right.

---------------------------Normal loop----------------------------------

2: decl %eax
jns 2b

offset of label 2:
0-9 6*loop_count + overhead
10-4092 5*loop_count + overhead
4093-4095 2*loop_count + overhead

offset mod 32 of label 2:
5 overhead = 47-49
other overhead = 15-17

---------------------------Nop loop-------------------------------------

2: decl %eax
nop
jns 2b

offset of label 2:
0-9 3*loop_count + overhead
10-4095 2*loop_count + overhead

overhead = 19-21

--------------------
Solution 1: add align 4 to __delay. This would leave a 20% variation in the
possible delay values, but at least there isn't a 3 times variation.
The nop should not be put in the loop, as the percentage variation
jumps to 50% in this case.

Solution 2: make __delay not be an inline function. That way it should not
cause problems as it will _always_ be at the same offset.

Solution 3: make __delay not be an inline function, and also have it use
rdtsc when available (this idea is from H. Peter Anvin). This means
adding some sort of check for rdtsc and using a function pointer
to choose the appropriate delay. As long as there are fewer bugs
in rdtsc than in jump prediction this will be a win (seems likely).

---------------------
Details. The entire timing loop.

movl $1, %ecx # for the exterior loop
cli
1: .byte 0x0f,0x31 # aka rdtsc
movl %eax,%ebx # save the low 32 bits
movl %esi,%eax # %esi has the loop count
2: decl %eax
jns 2b
.byte 0x0f,0x31 # aka rdtsc
subl $1, %ecx
jns 1b
sti
nop

For the nop case, the nop is moved from after the sti to after the
decl %eax. In each case the size of the timing loop, plus code to save the
address and difference is exactly 47 bytes. Repeated 4096 times, this gives
every possible offset mod 4096.

Note that here the label "2:" is used as the "address".

To measure the difference between overhead and loop count I used 10000, and
10000 as the values in %esi (the loop count), which gave a difference of
10 times.

---------------------------
C code to generate the assembly version of the test. (gzip'ed/uuencoded)

I'd like to hear how other processors fare with this code... !WARNING! If
you run the code below it will turn off interrupts for a relatively long
period of time (in terms of processor speed). This may cause problems on
your system. It causes rtc interrupts to be lost at 64Hz on my system.

compile the code, run it putting the output in <name>.s, and then compile
that assembly file. If you give the program an argument, it will generate
the nop code (it doesn't look at the argument... it's a hack).

It will of course only work for processors that have the rdtsc instruction.

to change the loop count (currently at 10000) you will have to go and edit
the assembly code... look for a $10000. There should only be one.
-gordo

begin 644 gendelay.c.gz
M'XL("!]7&30"`V=E;F1E;&%Y+F,`W5G=;]LV$'^V_XJ#$P%VHKB2+,NRTQ0K
M@BTO0=J7[&4I`GW:[&3)D.14V=#_?:2^2%&4K!0;L"U`"_N.]^-]\>Y(O[N`
MV^CP&J/M+H7I[0S4]7HEPUT4NU$(GP+TXL5P\6[\[@)^B3TO>`47)6F,[&-J
MV8$'Q]#%"]*=!W</C_#Y:`?(@7OD>*'C`99-$(;1Y@I$,>SP)@7:&0J=X.AZ
M\#Y)713-=Q_&8V=G8=XACH)H>_1^^P(WXS_'@/\FH[F/`F_T-/$MO-P+K->Y
M\S29R!6WW`8O4-2YHE+6UG&T9R?:'["X.]]4Y'GB.2D1F,>1:Z563;^_5384
MEI@9;C%J=$PA\L&.L+&0>&GR%-(]L)`J$I)<1@!\\I$3TT1B3ZGD;L!ZV<(-
M2+X,#A9,R6=>>B&6)O+%OZF4S3`$3"5_QLJ.YE:`MB'H#)C.@`51N`4E6RG6
MPL7_R4JV\#V??.M#6+81])7EJ?K*SA%\A7RK!5(O2[OAMD%D!^!&!VP6LRQ]
M/7BCBBS_Y!_#/(S5@I)!%3D<DUT`DF<?:M(^>B&4Y"`WR`G.6SA?:JNU)A-N
M"\)%+5+2)ME9<Z-S1;["H+H^);O-Q$Q%Q%2U@HJU<2FHLS_DRF1BV*];#^;W
MM1M'@6<%<*6OU25%<S+>%TZ)9G)HA33F+(Q:FK$YEV8%Y8:76!,-L\?^E=;'
M-#AF._.,C4`EI5:)#XDI\BLN+VX`Q)W8M41&-NF6?I)BKQ,>FK6W6@E=:[D8
M[UQ5Y`H2.UD7F[@6VL_&MVE%F0-V)I8N<L!LHQE"R\L<*#-JR:/YQ"\BTTOE
M3?-D?%1%$"#3/)V0)I_>/J[DN38)RN483IBDR3>0+"8"H6O#N;&6)6M'G1,2
MYZPY1**+D^79_79('`Y,5)O;D%TTD_I@+7*!,+I%XFBF('/KF+><3IHI#,ZE
MSGK2SJ4`2/P:)P&D))TJ,VJ;N6G"G:O:JJ,\!<0OJLZ5S7/2=BF&%020UW&_
MM]3P:<.E>*L27ZERSLVXU"D<KC>K_M?]`6NZ[,EJC3%;X!:JTUKKU(DOG@)V
M;^$F9:I*73>;]9XR!CS/%JTN=HVD)\==-40LD<:,07[NQTH;.^/+IQCTM)]5
M?3.T,`[UJ6'^@$]SSY!!B5+VQZ"G+1C"/E.UDKR"B?"7_?@%@-D9H&(YYE@M
MRZQ,[HFLWB^I*9V2N/:)DXDI99T-I]7?\^3LJ7U\4<'9L^&[KM:<1&JJ<!0I
MM12/(G5A/3FD.:VB6HU5BJBH&E3_X0DNF#)I!55/5]"N@;-P@=DL@<TQHVU%
M/F:HJ]YIL7O.,(2SYL!)U>"5R:*XF-&;@1=4$U$C7AK"@D9+(>(K&G+12U5"
M4;O<=<F)=N1=1$%Q=6B>QBK\#*FXMG"1HT5&.`@H(@=7V(XHM;3!J=6:(,K3
MI30UY`_/6Z=;533>*IV7E.H6TG<+&S#(:HO-4'>*M!*VHKX)W\GHQ,*VBF*(
MU;1V`4_0Z92IVL20M=7T+YP#EAT=CJ0DS7]6\1_(UIY[0O>L1\<\?.<:YI"\
MR34,P<,^5G:A-JGU<P"_4%?6RY-;T2.UZ#U2Q:E9-)\B2,B%K2[I:G5Z3ZM[
MZQUB2*L3WQ^T!5-_S8UP/$!#.^[0:[&XZZIK1I75IM6[3(5"LV]`T8%ONP6%
M?0$J*/P#"$$3@U%*[*54+=]K/":B/Y@'KYQ[U7H5$S^>[2T4\B]GA-9^-B/4
M'W\S6Y"QZ6]Z,RO/!W<X$/88OX2;=O9!Y/R./_"3J8+_FB>D"/6B<0L0%H.K
M16L>?'B\OQ]_OZY>JH,H.L!-(7CV\?/GIW#"S<4@D;-!Z4Z`ZB_J9C2W7U,/
ME$SQR0NI6K/.'CX]M_&DHD81A[4YY#Y!^!2#!=`V(]=S2@@J_#5,0+/I]RY]
MRDBW[2$`*@.0I-2^41@=AEK$8N9[244%;W(J&7(W<Z=2'AFQO64`-%OF3'Z#
M'AV[-:+_'/[',X"-T3^;#O^K##AS/1_A*>S7C_>//ZO3$`\YHVE>5'2XA/S[
MA3EK+M/892:[K/KA*\(WA]8O7V6]P^.%P77"1>/MN7]&R8LDWSCJ4FC^"[J>
MUNIZ>;/*65?-7H9<+TQ'3Y.[V]L-3.\>'F>@S5=SK?Z%JZ[3V-HQD9WB#V#%
M6T>&PM<7^,O+K'0R/<])&E_G)+(>80#RN63`3?ZIY/LP)7CP`=39>-1<0JI"
ML<R/8I@B3%:NH?Y=$WVY!G1YB<6*J6\ZD1*<5C*[8M:61_`>2!9PPN76<I6+
C>*JO\@W-!#`TR[KU8):4"#A8QSC$\N/OX[\`#?#(%#$>```>
`
end