Re: report on a crash of linux-2.2.7.SuSE

Andries.Brouwer@cwi.nl
Sun, 11 Jul 1999 00:42:02 +0200 (MET DST)


From: Kurt Garloff <garloff@suse.de>

On Fri, Jul 09, 1999 at 11:12:23PM +0200, Andries.Brouwer@cwi.nl wrote:
> An hour ago or so, linux-2.2.7.SuSE (as distributed
> with SuSE 6.1, recompiled to eliminate modules)
> died with
> "Aiee, killing interrupt handler"
>
> The stack trace shows that it had gotten into a loop,
> doing die_if_kernel() - die() - do_exit() - schedule() -
> error_code - do_page_fault() - die_if_kernel() - etc
> until stack overflow.

My guess would be either (1) miscompilation or (2) hardware problems.
Ad (1): Don't use gcc-2.95 without -fno-strict-aliasing.
Don't forget make oldconfig; make dep; make clean
Ad (2): Does this kernel run on another machine?
Does another kernel run on this machine?
Memory/CPU testing: Memory testers: memtest86.
100 kernel compilations
targzippping kernel source tree, untarring and
comparing several times.
Just some ideas.

Well, of course I sent it to the list because I think it is
a kernel bug. It smells like one.

Hardware? Three days old. One crash. Very heavy use at the
moment of crash. Two ethernet cards were transferring Gigabytes
from one machine to another. Gcc was active. Netscape was active.
I started yast and the machine crashed the same moment.

Bad memory? Have not had time for memtest86. Have read two 17 GB disks
several times with dd if=/dev/disk | md5sum and find consistent answers.
Have simultaneously tarred and untarred some trees and compiled some
kernels - no problems. There are no typical signs of flaky hardware.
(This is a Pentium II, 256 MB.)

Miscompilation? Well as I said, this was a fresh SuSE 6.1 install,
and the gcc that comes with it says

# cc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/egcs-2.91.66/specs
gcc version egcs-2.91.66 19990314 (egcs-1.1.2 release)

Miscompilation is a definite possibility. This version of cc
is certainly somewhat broken - it sometimes produces strange warnings.
I don't know whether it produces bad code.

Let me show you an example of egcs-2.91.66 behaviour:
----------------------------- foo.c ------------------------------------

void panic(const char * fmt, ...)
__attribute__ ((noreturn, format (printf, 1, 2)));

void printk(const char * fmt, ...)
__attribute__ ((format (printf, 1, 2)));

struct SC {
void (*fn)(int);
};

extern int get_int(void);
extern struct SC * get_SC(void);

void
foo() {
int mbo;
struct SC * p = get_SC();

if(!p) panic("Ouch!");

while (1){

mbo = get_int();

if (mbo < 0) {
printk("%d\n", mbo);
return;
}

if (p->fn)
(p->fn)(0);
else
printk("fn NULL?\n");
}
}
-----------------------------------------------------

# cc -Wall -O2 -c foo.c
foo.c: In function `foo':
foo.c:18: warning: `mbo' might be used uninitialized in this function
#

Strange.. - mbo is used after an assignment to it.
How can it be uninitialized?
And the strange part is, this is a minimal example:
every simplification of this example makes the complaint go away.
(Changing fn to be void (*fn)(void); or deleting the
`else printk("fn NULL?\n");' part, or deleting the `return;',
or changing the panic into a printk.
Also deleting the -O2 flag makes the complaint go away.)

So, egcs-2.91.66 with -O2 is confused about the flow of control.
With some phantasy I can imagine that, once confused, gcc manages
to optimize away some code that is necessary, or miscompile in
some other way.

I'll cc this to linux-gcc and egcs-bugs@egcs.cygnus.com.

Andries

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/