Re: 2.6.29 regression: ATA bus errors on resume

From: Niel Lambrechts
Date: Fri Mar 27 2009 - 15:11:18 EST


On 03/26/2009 12:20 AM, James Bottomley wrote:
> On Wed, 2009-03-25 at 02:06 -0400, Jeff Garzik wrote:
>> Niel Lambrechts wrote:
>>> On 03/25/2009 03:30 AM, Theodore Tso wrote:
>>>> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>>>>> Hi,
>>>>>
>>>>> After upgrading to 2.6.29 I get the below errors after resuming from
>>>>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>>>>> in 2.6.28.9-pae, but do not get any errors there.
>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>> errors are about I/O errors.
>>>>
>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>>>
>>>> This looks more like a hibernation problem than an ext4 problem.
>>>> Looks like the hard drive is being left in some inconsistent state
>>>> after resuming from hibernation.
>>>>
>>>> - Ted
>>> Thanks for the info Theodore, this is definitely looks like some type of
>>> regression in 2.6.29, as the problem is not evident when I s2disk using
>>> 2.6.28.9, even after multiple suspend/resume cycles.
>>>
>>> I found some 'ATA bus errors' and 'SError' messages in
>>> /var/log/messages, so I've attached the messages from both 2.6.29 and
>>> 2.6.28 for comparison.
>> Well, here is the interpretation of messages:
>>
>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>> ata1: SError: { PHYRdyChg CommWake }
>> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
>> which requires us to abort a bunch of queued commands:
>>
>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>> [...]
>>> ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>> ...through the 14th command (tag 13).
>>
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>
>> SATA link is reset, and ACPI is re-run.
>>
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
>>> Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
>>> Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
>>> Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
>>> Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
>> The SCSI subsystem aborts each of the queued commands.
>
> No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
> sense data for each of the outstanding I/Os
>
> The only place these are generated is in ata_sense_to_error() which only
> occurs if there's some type of ata error.
>
> If I had to theorise, I'd say the system suspended with commands
> outstanding to the device. On resume, the device gets reset and returns
> some type of ATA error which gets translated to ABORTED COMMAND which
> causes a failure.
>
> In the mid layer, we translate ABORTED_COMMAND into a retry until the
> command runs out of them ... could it be there's a race readying the
> device and we run through the retries before it can accept the command?

I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role in
2.6.29, perhaps due to the async. changes that was mentioned?


In particular, I notice some the following before "EH complete":
ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
ata1: irq_stat 0x00400040, connection status changed

but before this last message there is a
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Why would the connection status change if a moment ago it was "link up"?

Niel
Mar 27 20:18:01 linux-7vph -- MARK --
Mar 27 20:18:01 linux-7vph syslog-ng[32661]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=350', processed='center(received)=284', processed='destination(newsnotice)=0', processed='destination(acpid)=8', processed='destination(firewall)=0', processed='destination(null)=61', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=20', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=129', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=86', processed='destination(mailerr)=0', processed='destination(xconsole)=20', processed='destination(warn)=26', processed='source(src)=284'
Mar 27 20:18:00 linux-7vph kernel: Syncing filesystems ... done.
Mar 27 20:18:01 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
Mar 27 20:18:01 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Mar 27 20:18:01 linux-7vph kernel: PM: Shrinking memory... done (57185 pages freed)
Mar 27 20:18:01 linux-7vph kernel: PM: Freed 228740 kbytes in 2.03 seconds (112.67 MB/s)
Mar 27 20:18:01 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
Mar 27 20:18:01 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Mar 27 20:18:01 linux-7vph kernel: ACPI handle has no context!
Mar 27 20:18:01 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Mar 27 20:18:01 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Mar 27 20:18:01 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Mar 27 20:18:01 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
Mar 27 20:18:06 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
Mar 27 20:18:06 linux-7vph kernel: Disabling non-boot CPUs ...
Mar 27 20:18:06 linux-7vph kernel: CPU 1 is now offline
Mar 27 20:18:06 linux-7vph kernel: SMP alternatives: switching to UP code
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU1 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU1 is down
Mar 27 20:18:06 linux-7vph kernel: Extended CMOS year: 2000
Mar 27 20:18:06 linux-7vph kernel: PM: Creating hibernation image:
Mar 27 20:18:06 linux-7vph kernel: PM: Need to copy 125532 pages
Mar 27 20:18:06 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Mar 27 20:18:06 linux-7vph kernel: Intel machine check architecture supported.
Mar 27 20:18:06 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
Mar 27 20:18:06 linux-7vph kernel: Extended CMOS year: 2000
Mar 27 20:18:06 linux-7vph kernel: Enabling non-boot CPUs ...
Mar 27 20:18:06 linux-7vph kernel: SMP alternatives: switching to SMP code
Mar 27 20:18:06 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
Mar 27 20:18:06 linux-7vph kernel: Initializing CPU#1
Mar 27 20:18:06 linux-7vph kernel: Calibrating delay using timer specific routine.. 5054.03 BogoMIPS (lpj=10108064)
Mar 27 20:18:06 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Mar 27 20:18:06 linux-7vph kernel: CPU: L2 cache: 6144K
Mar 27 20:18:06 linux-7vph kernel: CPU: Physical Processor ID: 0
Mar 27 20:18:06 linux-7vph kernel: CPU: Processor Core ID: 1
Mar 27 20:18:06 linux-7vph kernel: Intel machine check architecture supported.
Mar 27 20:18:06 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
Mar 27 20:18:06 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Mar 27 20:18:06 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Mar 27 20:18:06 linux-7vph kernel: CPU0 attaching sched-domain:
Mar 27 20:18:06 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 27 20:18:06 linux-7vph kernel: groups: 0 1
Mar 27 20:18:06 linux-7vph kernel: domain 1: span 0-1 level CPU
Mar 27 20:18:06 linux-7vph kernel: groups: 0-1
Mar 27 20:18:06 linux-7vph kernel: CPU1 attaching sched-domain:
Mar 27 20:18:06 linux-7vph kernel: domain 0: span 0-1 level MC
Mar 27 20:18:06 linux-7vph kernel: groups: 1 0
Mar 27 20:18:06 linux-7vph kernel: domain 1: span 0-1 level CPU
Mar 27 20:18:06 linux-7vph kernel: groups: 0-1
Mar 27 20:18:06 linux-7vph kernel: Switched to high resolution mode on CPU 1
Mar 27 20:18:06 linux-7vph kernel: CPU1 is up
Mar 27 20:18:06 linux-7vph kernel: ACPI: Waking up from system sleep state S4
Mar 27 20:18:06 linux-7vph kernel: APIC error on CPU1: 00(40)
Mar 27 20:18:06 linux-7vph kernel: ACPI: EC: non-query interrupt received, switching to interrupt mode
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900407)
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Mar 27 20:18:06 linux-7vph kernel: e1000e 0000:00:19.0: irq 2298 for MSI/MSI-X
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 27 20:18:06 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.3: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pcieport-driver 0000:00:1c.4: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Mar 27 20:18:06 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Mar 27 20:18:06 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
Mar 27 20:18:06 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 27 20:18:06 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
Mar 27 20:18:06 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 27 20:18:06 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 27 20:18:06 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 27 20:18:06 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata1: EH complete
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 27 20:18:06 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Mar 27 20:18:06 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Mar 27 20:18:06 linux-7vph kernel: ata2: irq_stat 0x40000001
Mar 27 20:18:06 linux-7vph kernel: ata2.00: configured for UDMA/133
Mar 27 20:18:06 linux-7vph kernel: ata2: EH complete
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: power state changed by ACPI to D0
Mar 27 20:18:06 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Mar 27 20:18:06 linux-7vph kernel: Restarting tasks ... done.