Re: MD/RAID time out writing superblock

From: Mark Lord
Date: Mon Sep 21 2009 - 15:47:59 EST


Chris Webb wrote:
Chris Webb <chris@xxxxxxxxxxxx> writes:

Mark Lord <liml@xxxxxx> writes:

Speaking of which..

Chris: I wonder if the errors will also vanish in your situation
by disabling the onboard write-caches in the drives ?

Eg. hdparm -W0 /dev/sd?
Hi Mark. I've got a test machine on its way at the moment, so I'll make sure
I check this one out on it too.

Our test machine is still being built, but we had an opportunity to try this on
a couple of the live machines when their RAID arrays failed over the weekend.
We still got timeouts, but (predictably!) they're not on flushes any more:

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd 35/00:08:98:c6:00/00:00:4e:00:00/e0 tag 0 dm
...
all the way through the night.

I also have these in the log, but they are immediately after turning off the
write caching in all drives, so may be a red herring with data still being
written out.

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd c8/00:08:00:20:80/00:00:00:00:00/e0 tag 0 dm
...
On another machine, I saw this with write caching turned off:

ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
...

0x35 is a 48-bit DMA WRITE, 0xc8 is a 28-bit DMA READ,
and 0x61 is an NCQ WRITE.

Looks like some kind of hardware trouble to me.
And as Tejun suggested, it's difficult to guess at
a cause other than the PSU.

Cheers, and good luck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/