Now this is REALLY weird!

Peter Dalgaard BSA (pd@kubism.ku.dk)
Sat, 14 Sep 1996 15:21:17 +0200


(This is actually quite similar to the bug Chris Adams got with
kswapd, so I'm not at all sure he was right in attributing it to bad
memory)

I've been getting failures of klogd for quite a while now:

general protection: 0000
CPU: 0
EIP: 0010:[<001119cc>]
EFLAGS: 00010046
eax: 0000000a ebx: 00001000 ecx: 000006fe edx: 0800e00a
esi: 00000000 edi: 08004ede ebp: 00000000 esp: 00d65f74
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process klogd (pid: 62, process nr: 12, stackpage=00d65000)
Stack: 00012f80 00000000 00001000 00d381f0 0015e6bb 00000002 080047e0 00001000
0011f728 00d381f0 00012f80 080047e0 00001000 00dc1c0c 40001fb0 08000f68
bffffe1c 0010a482 00000000 080047e0 00001000 40001fb0 08000f68 bffffe1c
Call Trace: [<0015e6bb>] [<0011f728>] [<0010a482>]
Code: ff 89 c8 5b 5e 5f 5d c3 bd 01 00 00 00 85 ff 0f 84 47 ff ff

>>EIP: 1119cc <sys_syslog+118/20c>
Trace: 15e6bb <kmsg_read+f/14>
Trace: 11f728 <sys_read+88/b0>
Trace: 10a482 <system_call+52/80>

Code: 1119cc <sys_syslog+118/20c> decl 0x5f5e5bc8(%ecx)
Code: 1119d2 <sys_syslog+11e/20c> popl %ebp
Code: 1119d3 <sys_syslog+11f/20c> ret
Code: 1119d4 <sys_syslog+120/20c> movl $0x1,%ebp
Code: 1119d9 <sys_syslog+125/20c> testl %edi,%edi
Code: 1119db <sys_syslog+127/20c> je 00ffff5c <_EIP+ffff5c>
Code: 1119e1 <sys_syslog+12d/20c> nop
Code: 1119e2 <sys_syslog+12e/20c> nop

That's somewhere in printk.o, but I couldn't find anything like it in
printk.s. Closest thing was:

.L1137:
#APP
sti
#NO_APP
.L1182:
movl %ecx,%eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.align 4
.L1149:
movl $1,%ebp
.L1150:
testl %edi,%edi
je .L1128

That kinda makes sense. MOVL is c8 and all the POPs are 5n. Suspecting
gcc or gas bug now...

So I tried to remove printk.o and remake. To cut a long story short,

[root@bush linux]# od -x kernel/printk.o | grep 5bc8
0000500 4147 83fa 003d 0000 0000 c875 89ff 5bc8

There's the offending bit. On my other machine i have 89fb instead of
89ff there. Makes sense: fb is STI, ff is first byte of DEC instr.

"Strange, I thought I checked that earlier and it really did say 89fb
then."

I then logged out as root and tended to some other things.

Back in. I've got to file some kind of report on this, so I redo the
grep:

[root@bush kernel]# od -x printk.o | grep 5bc8
0000500 4147 83fa 003d 0000 0000 c875 89fb 5bc8

Huh? The ff byte flipped to fb all by itself? Note that the fingers at
no point have left the hand...

There can be only one explanation: The buffer cache has been corrupted
immediately AFTER printk.o was written to disk.

This is perfectly reproducible. Happens every time I build a kernel
while running one of the later kernels, even if that kernel itself was
built under 1.2.13. (Only done that a few times, but I plan to test
this more exhaustively some time next week) I've run an extended
memory check on the system, and it finds nothing within the first hour
or so at least, so I guess that flaky memory can be ruled out?

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907