Re: Oracle RBS corruption (was: bdflush problem ?)

=?iso-8859-1?Q?Fran=E7ois=20D=E9sarm=E9nien?= (desar@club-internet.fr)
Wed, 01 Dec 1999 23:04:55 +0100


First, thank you for your reply ! I felt a little bit lonely...

Richard Zidlicky wrote:
>
> > And I'm just experiencing againg a rollback segment corruption again, and it
> > seems there hasn't been any crash. So I suspect there is something else :-(
>
> :-(
> That would mean that the database program is broken. In fact, thinking a bit
> more about it I highly doubt that the missing update could have such an effect
> on a Database - after all they are expected to do explicit 'sync' occassionally
> so that they should be safe without it.
>

It seems they don't 'sync', as when I did one manually, before running 'update',
it took so long time I think everything was in memory. And when a crash occured,
my customer lost a whole day of everything (database, files with samba, inodes ...)

> Did you check the state of allocated buffers (shift-scrollock) ?
> Run memory testers? It seems very modern to get corrupted memory nowadays,
> there was a thread about it just recently on l-k.
>
Can I check it another way ? I use a modem to connect to the server, so I
suspect <shift-scrollock> won't do it through the phone line :-)
However, what should I check for ?

Do you have pointers on memory testers programs ? However, I don't beleave
this option as I got corruption on a few different servers.

> Btw, does the DB prog raw device access? If so is the whole device used
> only for the DB?
>
No, I don't use raw devices, just plain files.

> > I noticed a similarity with two sites on which I had corruption: they both
> > running HP servers with a HP NetRaid (amird) controller. One runs a 2.0.36
> > kernel, and the other a 2.2.9 kernel (RedHat 5.2 and Mandrake 6.0).
>
> hard to find a common point between those two kernels.
>
True for the kernels, but both servers use amiraid driver.

> > Maybe it could be something wrong with amird driver or HP NetRaid firmware ?
>
> I have no experience with raid, but I suspect if there was something wrong
> with the device it should show up in the logs. You should not forget to double
> check your database setup, try a smaller toy database and see if the problem
> is reproducible on other devices.
>
Nothing in the logs. But if something goes wrong within the driver/firmware, I
think anything can happen.

As doing tests, we cannot afford such beast we sell to our customer...

Worse than that, I've been through Oracle log file, and found the file was
scrambled with pieces of an include file of the kernel. So there *is* memory/disk
corruption, and Oracle is not the only culprit :-(

My sysadmin life is becoming a nightmare, and my collegues are starting to
say that Linux is not so stable. Too bad :-<

François Désarménien

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/