RE: LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is unstable

Roeland M.J. Meyer (rmeyer@mhsc.com)
Fri, 5 Nov 1999 12:34:43 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: nathan.zook@amd.com: "RE: I think I found cause of severe data corruption on 2.3.24-ikd"
Previous message: Pavel Machek: "Re: I think I found cause of severe data corruption on 2.3.24-ikd -- memory detection"

FWIW, I've had a RAID0 partion, doing UseNet, for the past year. No problems
found to date. It's pair of 2GB SCSI-W on an adaptec AHA290x controller.
The kernel is v2.0.37. I have had problems in the past where the disk was
flakey (but not completely dead) and the RAID stuff didn't help me much in
detremining that. In those cases, I had to remount the disk, outside of the
RAID environment, and run single-disk diagnostics. In one extreme case, only
black-box replacement of the drive yielded the fact that it was the drive
(process of elimination). I RMA'd the drive and everything has been working
fine since. I could not have diagnosed the problem had I not had spare
machines and extra hard drives to compare against.

There are many cases where swap-n-pray is the ONLY diagnostic mechanism that
works (unless you have extensive hardware testing facilities, of course).
The lesson is, the problem's not always software.

> -----Original Message-----
> From: owner-linux-raid@vger.rutgers.edu
> [mailto:owner-linux-raid@vger.rutgers.edu]On Behalf Of David Mansfield
> Sent: Friday, November 05, 1999 11:29 AM
> To: Ingo Molnar
> Cc: Michael K. Johnson; linux-raid@vger.rutgers.edu;
> linux-kernel@vger.rutgers.edu; Alan Cox
> Subject: Re: LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is
> unstable
>
>
>
> > it's 99.99% a problem with the disk. The RAID0 code has not had any
> > significant changes (due to it's simplicity) in the last
> couple of years.
> > We never rule out software bugs, but this is one of those
> cases where it's
> > way, way down in the list of potential problem sources.
> >
> > -- mingo
> >
>
> So all of the 'attempt to access beyond end of device' errors reported
> over the last 6 months, both RAID and standard kernel are all
> chalked up
> to hardware failure? I mean, that's what this is. It just
> hits the RAID0
> request layer BEFORE the standard check that generates the
> 'attempt...'
> error. I believe there is a bug lurking, because people keep
> reporting
> the 'attempt...' error, and this is that exact problem, it just hits a
> different place.
>
> My $.02 (and very little kernel hacking exp :-( ) make me
> believe there is
> still one bug in the buffer-cache code (generic kernel) that
> causes this,
> but thanks for the feedback. I'll shut-up now.
>
> David
>
> --
> /==============================\
> | David Mansfield |
> | david@cobite.com |
> \==============================/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: nathan.zook@amd.com: "RE: I think I found cause of severe data corruption on 2.3.24-ikd"
Previous message: Pavel Machek: "Re: I think I found cause of severe data corruption on 2.3.24-ikd -- memory detection"