Re: Problem with ata layer in 2.6.24

From: Jeff Garzik
Date: Mon Jan 28 2008 - 14:09:03 EST


Gene Heskett wrote:
Greeting;

I had to reboot early this morning due to a freezeup, and I had a bunch of these in the messages log:
==============
Jan 27 19:42:11 coyote kernel: [42461.915961] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 27 19:42:11 coyote kernel: [42461.915973] ata1.00: cmd ca/00:08:b1:66:46/00:00:00:00:00/e8 tag 0 dma 4096 out
Jan 27 19:42:11 coyote kernel: [42461.915974] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 27 19:42:11 coyote kernel: [42461.915978] ata1.00: status: { DRDY }
Jan 27 19:42:11 coyote kernel: [42461.916005] ata1: soft resetting link
Jan 27 19:42:12 coyote kernel: [42462.078216] ata1.00: configured for UDMA/100
Jan 27 19:42:12 coyote kernel: [42462.078232] ata1: EH complete
Jan 27 19:42:12 coyote kernel: [42462.090700] sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
Jan 27 19:42:12 coyote kernel: [42462.114230] sd 0:0:0:0: [sda] Write Protect is off
Jan 27 19:42:12 coyote kernel: [42462.115079] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
===============
That one showed up about 2 hours ago, so I expect I'll be locked up again before I've managed a 24 hour uptime. This drive passed
a 'smartctl -t long /dev/sda' with flying colors after the reboot
this morning.

Two instances were logged after I had rebooted to 2.6.24 from 2.6.24-rc8:

Jan 24 20:46:33 coyote kernel: [ 0.000000] Linux version 2.6.24 (root@xxxxxxxxxxxxxxxxx) (gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #1 SMP Thu Jan 24 20:17:55 EST 2008
----
Jan 27 02:28:29 coyote kernel: [193207.445158] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 27 02:28:29 coyote kernel: [193207.445170] ata1.00: cmd 35/00:08:f9:24:0a/00:00:17:00:00/e0 tag 0 dma 4096 out
Jan 27 02:28:29 coyote kernel: [193207.445172] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 27 02:28:29 coyote kernel: [193207.445175] ata1.00: status: { DRDY }
Jan 27 02:28:29 coyote kernel: [193207.445202] ata1: soft resetting link
Jan 27 02:28:29 coyote kernel: [193207.607384] ata1.00: configured for UDMA/100
Jan 27 02:28:29 coyote kernel: [193207.607399] ata1: EH complete
Jan 27 02:28:29 coyote kernel: [193207.609681] sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
Jan 27 02:28:29 coyote kernel: [193207.619277] sd 0:0:0:0: [sda] Write Protect is off
Jan 27 02:28:29 coyote kernel: [193207.649041] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 27 02:30:06 coyote kernel: [193304.336929] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 27 02:30:06 coyote kernel: [193304.336940] ata1.00: cmd ca/00:20:69:22:a6/00:00:00:00:00/e7 tag 0 dma 16384 out
Jan 27 02:30:06 coyote kernel: [193304.336942] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 27 02:30:06 coyote kernel: [193304.336945] ata1.00: status: { DRDY }
Jan 27 02:30:06 coyote kernel: [193304.336972] ata1: soft resetting link
Jan 27 02:30:06 coyote kernel: [193304.499210] ata1.00: configured for UDMA/100
Jan 27 02:30:06 coyote kernel: [193304.499226] ata1: EH complete
Jan 27 02:30:06 coyote kernel: [193304.499714] sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
Jan 27 02:30:06 coyote kernel: [193304.499857] sd 0:0:0:0: [sda] Write Protect is off
Jan 27 02:30:06 coyote kernel: [193304.502315] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

None were logged during the time I was running an -rc7 or -rc8.

The previous hits on this resulted in the udma speed being downgraded till it was actually running in pio just before the freeze that required the hardware reset button.

Unfortunately there are 1001 different causes for timeouts, so we need to drill down into the hardware, libata version, and ACPI version (most notably).


I'll reboot to -rc8 right now and resume. If its the drive, I should see it.
If not, then 2.6.24 is where I'll point the finger.

There was also an ACPI update, which always affects interrupt handling (whose symptom can sometimes be a timeout).

Definitely interesting in test results from what you describe.

Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/