Re: 2.6.29 regression: ATA bus errors on resume

From: Tejun Heo
Date: Tue May 26 2009 - 00:58:42 EST


Hello, Niel.

Niel Lambrechts wrote:
> I've tested all of the kernels I have again since 2.6.29.4 also came out
> just recently. I did a hibernate/resume for each in the console, then
> repeated the same in X, then continued to the next kernel.
>
> The 2.6.29.4 log is much larger, since some other badness happened there
> - there is a large kernel trace in there as my first X hibernation
> attempt failed and came back to X after a few seconds. The system seemed
> functional, it did not keep generating kernel messages - when I then
> retried a hibernate it worked, along with the resume. Another unrelated
> bug perhaps?
>
> As for "hard resetting link" messages, they seemed to always happen
> under X the times I tried it.
>
> Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
> 2.6.28.10 No no yes yes no
> 2.6.29.4* No no no no no
> 2.6.29.4** No - - yes no
> 2.6.30-rc6 Yes - - yes no
> 2.6.30-rc6 No no no yes no
>
> * Xorg hibernation attempt failed.
> * Xorg Second hibernation attempt (no extra reboot)
>
> I also did a side by side comparison of the messages I have for
> 2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
> another one that worked just fine tonight. I stripped all time-stamps
> and some pulseaudio messages from the bad one and attached them here,
> and also saved the full messages for each kernel to
> http://bugzilla.kernel.org/show_bug.cgi?id=13017 .
>
> Since analysing the code-path is still a bit beyond me, I'll leave you
> with a little summary of the differences I notice.
>
> A = 2.6.30-rc6 (EXT4 clean)
> B = 2.6.30-rc6 (EXT4 errors triggered)

Duplicate PHY events are likely to be dependent on timing and
non-deterministic. The ext4 corrupting or not depends on whether a
request with failfast set was in-flight at the time of the second PHY
event, which again is dependent on timing. At any rate, this looks
like a problem of ext4 (or something between ext4 and the driver). It
either shouldn't issue failfast command or should take appropriate
recovery action if it does. It would be really nice if you can give a
shot at ext3.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/