Re: Core dumps & restarting

Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 29 Oct 1996 22:10:04 +0000 (GMT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Alan Cox: "Re: Linux box finds it hard to wake up in the morning"
Previous message: Andy Wang: "kernel 2.1.6 (or hell, 2.1.x boot problems)"
In reply to: Anthony Pardini: "Aiee"

> connections. Some code dies because time suddently warps. Or where
> to position file pointers when restoring a process? This is usually
> trivial when the file hasn't changed since taking the snapshot but
> can get very hairy otherwise.
>
> This depends upon how short you can get "later" and how much state you
> can fully save. Behold...
>
> We can already do things (generic unix'y speaking) like dump a
> complete core image of ram onto a disk when we punt, and we have the
> technology for multiple initiator SCSI configuarations and to make
> that work.

Messy. If you are doing this kind of shit for real (and I mean real as in
the 'we have 300 seconds to get back on our feet or the furnace is
6000ft into the upper atmosphere' type real) then you are taking snapshots
of both the program and data onto remote machines. (Thats quite easy to do
with a log based fs). You are also making synchronization points and each
of these you mark the file log as 'believed coherent' so that a crash
won't leave junk on the files to fool a restart.

A logging file system is all part of a dump/restore recovery mechanism. A
blind recovery can be very bad - even for non critical systems. A crash
and restart of a large engineering model could slip a small but critical
error into the calculations by seeking to the wrong place, writing a value
and crashing because of an earlier error that causes the bad write . You might
do a few weeks further modelling before you see that error.

Alan

Next message: Alan Cox: "Re: Linux box finds it hard to wake up in the morning"
Previous message: Andy Wang: "kernel 2.1.6 (or hell, 2.1.x boot problems)"
In reply to: Anthony Pardini: "Aiee"