RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

From: Manish Lachwani (manish@Zambeel.com)
Date: Tue Nov 19 2002 - 16:59:38 EST


0x40 indicate Uncorrectable ECC errors. CHeck the SMART data from the drive
using utility called smartctl. You should probably see a Current Pending
Count. As long as the pending sector does not reallocated, you will continue
to see this problem when reading that specific sector. WHat you can do is
write to that sector and get the sector remapped. Once remapped, these
problems on that sector wont occur.

ALso, we had been using TYAN S2518 with OSB4 in UDMA 2. However, it causes
data corruption when there is IO with even one drive. We used two drives in
master-master and master-slave mode and it did not solve the problem. EVen
at UDMA 0, we experienced the same issue of data corruption. Once does not
need to have 0x40 errors to see this corruption. If you want to see this
corruption, use the dt utility and then run:

./dt of=/dev/hdc bs=300M incr=512 enable=raw iodir=forward pattern=iot
log=tst_hdc.log dispose=keep iodir=reverse iotype=sequential pattern=iot
enable=compare,coredump,debug,raw,verify,verbose,pstats &

on hda and hdc. Also, try to have network traffic in parallel. This will
result in data getting shifted by 4 bytes ...

-----Original Message-----
From: Steven Timm [mailto:timm@fnal.gov]
Sent: Tuesday, November 19, 2002 1:47 PM
To: linux-kernel@vger.kernel.org
Subject: Serverworks dma_intr: error=0x40 { UncorrectableError }

The problems with the Serverworks OSB4 chipset are well documented
on this list. It is known that changing to multi-word DMA mode
will severely reduce problems. However, since we need high
throughput in our application, we are running at high DMA speeds still.
The configuration in question is:

2.4.9-31smp kernel, Tyan 2518 motherboard, Western Digital 20 Gb system
disk hda, and hdc and hdd being each a 40Gb Western Digital drive.

The error usually happens on the system disk. In that location
of the disk, there are usually some corrupted files as a result.
But when we either reinstall the system disk, formatting checking
for bad blocks, or low-level format the disk and reinstall, the
system is able to go on for quite some length of time
without these errors repeating.

My question--is there any way that the Serverworks OSB4 could be
causing a soft error on the disk such that the disk appears to
be bad (sometimes to the point that SMART puts up an alert in
the BIOS saying it is bad) but yet can still be used effectively?

(And yes, I am aware that the 2.4.18 kernels trap this condition and
usually stop the file corruption before it happens.)

Nov 7 05:08:10 fnd0102 kernel: Curious - OSB4 thinks the DMA is still
running.
Nov 7 05:08:10 fnd0102 kernel: OSB4 wait exit.
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=4457382, sector=4457312
Nov 7 05:08:10 fnd0102 kernel: end_request: I/O error, dev 03:01 (hda),
sector
4457312
Nov 7 05:08:10 fnd0102 kernel: EXT2-fs error (device ide0(3,1)):

Steve Timm

------------------------------------------------------------------
Steven C. Timm (630) 840-8525 timm@fnal.gov http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Nov 23 2002 - 22:00:29 EST