Re: amd64 sata_nv (massive) memory corruption

From: Martin K. Petersen
Date: Thu Aug 07 2008 - 12:47:18 EST

Next message: Jesse Barnes: "Re: "e100_probe: Error clearing wake event" when booting 2.6.27-rc1"
Previous message: Stefan Richter: "Re: [ANNOUNCE] mdb-2.6.27-rc2-ia32-08-07-08.patch"
In reply to: Linas Vepstas: "Re: amd64 sata_nv (massive) memory corruption"
Next in thread: Linas Vepstas: "Re: amd64 sata_nv (massive) memory corruption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>>>>> "Linas" == Linas Vepstas <linasvepstas@xxxxxxxxx> writes:

Linas> My problem is that the corruption I see is "silent": so
Linas> redundancy is useless, as I cannot distinguish good blocks from
Linas> bad. I'm running RAID, one of the two disks returns bad data.
Linas> Without checksums, I can't tell which version of a block is the
Linas> good one.

But btrfs can.

Linas> There is also in interesting possibility that offers a middle
Linas> ground between raw performance and safety: instead of verifying
Linas> checksums on *every* read access, it could be enough to verify
Linas> only every so often -- say, only one out of every 10 reads, or
Linas> maybe triggered by a cron job in the middle of the night: turn
Linas> on verification, touch a bunch of files for an hour or two,
Linas> turn off verification before 6AM.

All evidence suggests that scrubbing is a good way to keep your data
healthy.

A common corruption scenario a few years ago was bleed to adjacent
tracks due to a frequently written hot spot on disk. Scrubbing in
RAID arrays helped fix that. Modern drives actually maintain an
internal list of hot spots and will automatically schedule refreshes
of adjacent blocks to prevent bleed.

But there are obviously other corruption scenarios that scrubbing can
help alleviate -- including genuine bit rot on the platter.

Linas> Yes, well, my HBA is soldered onto my MB, and I'm buying $80
Linas> hard drives one at a time at Frye's electronics, so it could be
Linas> 5-10 years before DIX/DIF trickles down to consumer-grade
Linas> electronics. And I don't want to wait 5-10 years ...

I doubt it's going to take *that* long.

Corruption of in-flight data has been a problem for years. And it is
a problem that RAID and FS checksums can't fix.

Oracle has been providing customers with in-flight integrity
protection on high-end arrays for many years using a proprietary
technology called HARD. Array vendors license it from us and HARD is
mandatory in a lot of business and government deployments.

DIF/DIX is our attempt to make integrity protection available on mid-
to low-range equipment. We decided to embrace and extend an existing,
open standard and are working with standards bodies to nudge them in
the right direction in terms of new features. It has taken about two
years from conception to product in a highly conservative, slow-moving
industry.

As as I mentioned earlier, T13 is working on EPP which is essentially
DIF for SATA. The protection format is the same which means we can
prepare one type of integrity information regardless of whether the
target drive is SCSI or SATA.

Once External Path Protection is ratified I'm expecting drives to
appear fairly quickly. The turnaround time should be short as SATA
drive generations don't last nearly as long as SCSI.

Linas> Thus, a "tactical" solution seems to be pure-software
Linas> check-summing in a kernel device-mapper module, performance be
Linas> damned.

What I don't understand is why you are so focused on fixing this at
the RAID level. I think your time would be better spent contributing
to btrfs which gives you checksums and redundancy on consumer grade
hardware today. It's is only a few months away from GA. So why not
implement scrubbing in btrfs instead of spending time on a kludgy
device mapper module with crappy performance?

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jesse Barnes: "Re: "e100_probe: Error clearing wake event" when booting 2.6.27-rc1"
Previous message: Stefan Richter: "Re: [ANNOUNCE] mdb-2.6.27-rc2-ia32-08-07-08.patch"
In reply to: Linas Vepstas: "Re: amd64 sata_nv (massive) memory corruption"
Next in thread: Linas Vepstas: "Re: amd64 sata_nv (massive) memory corruption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]