critical bug in sata driver ?

From: Janos Haar
Date: Mon Jun 05 2006 - 06:26:51 EST


Hello, list,

I have a reproducible _software_ bug about sata. (libata?)

In my system i have 4 nodes, each has 12 hdd.
All hw are equal, and 110% tested, and really error free!

But 2 of my nodes, frequently reboots without and any error message on
remote syslog, and netconsole.
No oops, no nothing!
(loglevel 9)

Only in serial console can show some time a little clue:

"ata2: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error"

After this little partial message, the system immediately reboots!

Thats the all.

The smart log is clear in _all_ my 48 hdd.
Anyway the disk and cable in ata2 port is replaced, but this not helps.

The kernel 2.6.16.1, 2.6.17-rc3-git1. (now i trying the latest rc and git.)
(I cannot try the 2.6.16 > kernels, because my sata card are unsupported on
older versions.)

The dmesg log is here from my node #1:
http://download.netcenter.hu/bughunt/20060605/dmesg.txt

I have compile the kernel with these debug options:
[*] Magic SysRq key
[*] Kernel debugging
[*] Collect scheduler statistics
[*] Mutex debugging, deadlock detection
[*] Force gcc to inline functions marked 'inline'
[*] Check for stack overflows
(2) Stack backtraces per line
[*] Debug page memory allocations
[*] Write protect kernel read-only data structures

Can somebody find and fix it?

I can do almost anything to helping to debug!


Thanks,
Janos

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/