Re: 2.6.24.3: regular sata drive resets - worrisome?

From: Tejun Heo
Date: Sat Mar 29 2008 - 20:54:46 EST


Hello,

Hans-Peter Jansen wrote:
Should I be worried? smartd doesn't show anything suspicious on those.
Can you please post the result of "smartctl -a /dev/sdX"?

Here's the last smart report from two of the offending drives. As noted before, I did the hardware reorganization, replaced the dog slow 3ware 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives for now, but a nephew already showed interest. What do you think, can I cede those drives with a clear conscience? The Hardware_ECC_Recovered
values are really worrisome, aren't they?

Different vendors use different scales for the raw values. The value is still pegged at the highest so it could be those raw values are okay or that the vendor just doesn't update value field accordingly. My P120 says 0 for the raw value and 904635 for hardware ECC recovered so there is some difference. What do other non-failing drives say about those values?

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 82
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17647
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
190 Airflow_Temperature_Cel 0x0022 124 124 000 Old_age Always - 38
194 Temperature_Celsius 0x0022 124 124 000 Old_age Always - 38
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0

Hmmm... If the drive is failing FLUSHs, I would expect to see elevated reallocation counters and maybe some pending counts. Aieee.. weird.

It's been 4 samsung drives at all hanging on a sata sil 3124:
FLUSH_EXT timing out usually indicates that the drive is having problem
writing out what it has in its cache to the media. There was one case
where FLUSH_EXT timeout was caused by the driver failing to switch
controller back from NCQ mode before issuing FLUSH_EXT but that was on
sata_nv. There hasn't been any similar problem on sata_sil24.

Hmm, I didn't noticed any data distortions, and if there where, they live
on as copies in their new home..

It should have appeared as read errors. Maybe the drive successfully wrote those sectors after 30+ secs timeout.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/