Black Friday

From: Nick Warne
Date: Fri Oct 15 2004 - 14:22:34 EST

Next message: Christoph Lameter: "page fault scalability patch V10: [0/7] overview"
Previous message: Christoph Lameter: "page fault scalability patch V10: [7/7] s/390 atomic pte operations"
Next in thread: Nick Warne: "Re: Black Friday"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

Here is a story that happened to me today, and a sightly warning for any
sysadmins that read here.

At 9:50 am on the 1st Oct (2 weeks ago), main file server at work crashed big
time (a Snapserver 4100). It trashed the RAID (240GB - 170GB usable [50GB
free]), but the OS (BSD based, I believe) is pretty good and rebuilt it -
took 8 hours - no data lost out of 120+ GB

I also have a second Snapserver that runs Quantums own synchronisation
software, so that at any point it time, server 'b' is an exact copy of server
'a' - this I set to sync at 1:00 am each day. The idea being you can
recover/swap from server to server real time.

The first crash proved problems I never thought off in disaster recover
options. The Snapserver synchronisation software doesn't 'sync' directory
share nor file permissions - just the actual binary data. So the 'copy'
server 'b' is not as is server 'a'.

OK, so since that 'black Friday' two weeks ago, I hacked a way to get the file
permissions from box ''a' to box 'b' replicated manually so at least a quick
swap over from box 'a' to box 'b' would be possible and the change over for
the users would be invisible and all file/share permissions are correct.

Today at 9:50 Snapserver box 'a' crashed again. I suspected that it was dying
now, and the better move would be to push everybody to the back up box 'b'
and replace the dodgy box 'a' until I could replace it.

Except box 'b' was AWOL as well (I didn't scream exactly...). That crashed
too at 9:50 (yes, in sync with box 'a'). The 'sync' software is bloody
good ;)

After a lengthy discussion with Snap engineers, it turns out the OS does a
KERNEL PANIC after a certain number of file opens/shares get accessed on the
version OS I was running. It only does this once the server reaches the
point of whatever threshhold causes it - i.e. a growing, expanding fileserver
- they didn't really tell me, nor elaborate.

But because two weeks ago, only one server crashed, I put it down to the
gremlins... but as 'a' had not sync'ed with 'b' yet, that is why only one
crashed then, and as both are sync'ed since, today both of them.

Two weeks later, both have the same threshhold to cause the kernel crash.

Snap guys gave me new OS firmware to flash - I have done one server, the main
server tomorrow.

The gist of this mail? Never, ever think you are safe.

Nick
P.S still have DLT backup anyway - but users want to USE NOW not wait :(

--
"When you're chewing on life's gristle,
Don't grumble, Give a whistle..."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christoph Lameter: "page fault scalability patch V10: [0/7] overview"
Previous message: Christoph Lameter: "page fault scalability patch V10: [7/7] s/390 atomic pte operations"
Next in thread: Nick Warne: "Re: Black Friday"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]