Re: sata error on ICH8M

From: Tejun Heo
Date: Tue Nov 11 2008 - 09:01:33 EST


Hello,

Tim Connors wrote:
> I'm running a debian 2.6.26-9 kernel (sid) on a new laptop with:

Hmmm...

> 00:1f.2 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
> Subsystem: Dell Device 0275
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin B routed to IRQ 378
> Region 0: I/O ports at 1c00 [size=8]
> Region 1: I/O ports at 18d4 [size=4]
> Region 2: I/O ports at 18d8 [size=8]
> Region 3: I/O ports at 18d0 [size=4]
> Region 4: I/O ports at 18e0 [size=32]
> Region 5: Memory at f8504000 (32-bit, non-prefetchable) [size=2K]
> Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/2 Enable+
> Address: fee0300c Data: 41a1
> Capabilities: [70] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
> Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [a8] SATA HBA <?>
> Kernel driver in use: ahci
> Kernel modules: ahci
>
> Soon after booting, and possibly after the disk had spun down and was
> asked to spin back up (which it had done successfully a few times so
> far, with a spinddown timeout of 10 minutes, and using laptop_mode), it
> had a sata error:
>
> Nov 4 01:49:29 gamow kernel: [ 1865.106289] ata1.00: exception Emask 0x10 SAct 0x3ff SErr 0x50000 action 0xe frozen
> Nov 4 01:50:29 gamow kernel: [ 1865.106307] ata1: SError: { PHYRdyChg CommWake }
> Nov 4 01:50:29 gamow kernel: [ 1865.106319] ata1.00: cmd 61/08:00:bd:c9:57/00:00:00:00:00/40 tag 0 ncq 4096 out
> Nov 4 01:50:29 gamow kernel: [ 1865.106322] res 40/00:00:02:4f:c2/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
> Nov 4 01:50:29 gamow kernel: [ 1865.106329] ata1.00: status: { DRDY }
> Nov 4 01:50:29 gamow kernel: [ 1865.106339] ata1.00: cmd 61/08:08:35:7a:73/00:00:00:00:00/40 tag 1 ncq 4096 out
> Nov 4 01:50:29 gamow kernel: [ 1865.106343] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
> Nov 4 01:50:29 gamow kernel: [ 1865.106350] ata1.00: status: { DRDY }
> Nov 4 01:50:29 gamow kernel: [ 1865.106361] ata1.00: cmd 61/08:10:35:36:9b/00:00:00:00:00/40 tag 2 ncq 4096 out
> Nov 4 01:50:29 gamow kernel: [ 1865.106364] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
> Nov 4 01:50:29 gamow kernel: [ 1865.106371] ata1.00: status: { DRDY }
> Nov 4 01:50:29 gamow kernel: [ 1865.106382] ata1.00: cmd 61/28:18:55:94:e2/00:00:00:00:00/40 tag 3 ncq 20480 out
>
> And then it fails to reset after some time, remounting the devices
> readonly (although what wasn't already in the cache became unreadable with
> lots of IO errors rapidly filling up dmesg). None of the logs made it to
> disk, naturally enough, and this was what I caught in syslog before syslog
> bailed. There were interesting messages that happened after this, but the
> dmesg buffer filled up before I thought about saving them.

Well, the disk is already a goner at that point so you need to either
set up a netconsole or plug in a usb stick, mount it and do "while
true; do dmesg -c >> /mnt/usbstick/dmesg.out; sleep 1; done" to
capture the kernel log.

> My /sys/class/scsi_host/host0/link_power_management_policy is:
> min_power
> . powertop had earlier (in a previous warm-boot) prompted me
> to set link_power_management_policy, so I had been tweaking that, but I
> didn't look at its default setting - I presume it was already at
> min_power, as it is now from a fresh (cold) bootup.
>
> Just in case it is related, in each bootup, I earlier get
> Nov 4 01:20:26 gamow kernel: [ 116.217492] CE: hpet increasing min_delta_ns to 15000 nsec
> Nov 4 01:20:28 gamow kernel: [ 118.073078] CE: hpet increasing min_delta_ns to 22500 nsec
> Nov 4 01:20:30 gamow kernel: [ 120.467250] ACPI: EC: missing confirmations, switch off interrupt mode.
> Nov 4 01:20:30 gamow kernel: [ 120.603593] CE: hpet increasing min_delta_ns to 33750 nsec
> Nov 4 01:20:31 gamow kernel: [ 121.004372] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20080321]
> Nov 4 01:20:31 gamow kernel: [ 121.004372] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.LPCB.BAT1._BST] (Node ffff81013fa6cb90), AE_TIME
> Nov 4 01:20:31 gamow kernel: [ 121.004372] ACPI Exception (battery-0360): AE_TIME, Evaluating _BST [20080321]
>
> and/or
>
> [ 1587.842640] CE: hpet increasing min_delta_ns to 15000 nsec
> (successively increasing as time goes on)
>
> The sata link went belly up perhaps a few minutes after I went to bed
> lastnight, and the only thing I can think of that I did before then was to
> unplug and replug the ethernet and/or the wireless. The ACPI messages
> seem to happen around networking events on this laptop, but I haven't had
> a chance to investigate further.

I really need to see how the recovery attempt failed. Can you please
reproduce the problem and report the kernel log? Also, please report
how reproducible the problem is.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/