Re: Core dumps & restarting

Kai Henningsen (kai@khms.westfalen.de)
05 Nov 1996 09:49:00 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Michael Elizabeth Chastain: "Re: Small speedup for 'system_call'"
Previous message: Ulrich Windl: "time warps: good bad news"

davem@caip.rutgers.edu (David S. Miller) wrote on 29.10.96 in <199610290516.AAA13856@caip.rutgers.edu>:

> Why not dump the core ram image to another "machine", drop
> reservations on all the SCSI devices you are talking to, and then tell
> the machine "mount my disks, assume my ip addresses, and act like me,
> because I'm going down". It can work with something like a 3 minute
> max takeover time if you do it right. If the panic'ing machine can
> come back up cleanly, the transfer of core image can happen again,
> scsi device ownership given back, ip interfaces set back up, and you
> are _still_ back in operation. You can get it so good that it only
> looks like the network is saturated to your users ;-)

Sounds like what you *really* want is the thing Novell calls SFTIII. I've
never seen it myself, but I gather the idea is to have (usually) two
machines with identical hardware (and preferrably a *very fast* dedicated
network connecting them), acting to the outside world like a single
machine. One of them dies, you clients never notice.

It seems to work something like this: you have a lower level OS part on
both of them that deals with the hardware, and you have a higher level OS
that keeps itself synchronized via the internal net connection and acts
like a single OS to the net and any user processes. Keeping the disks in
synch reduces to simple mirroring in this scenario, btw. (Of course, it
probably needs to be able to change ethernet hardware addresses to make
the interface on one server look *exactly* like the interface on the other
to the clients.)

(SFTII seems traditionally to have all sorts of convolutions connected to
where you load stuff, in the lower level, individual parts, or in the
common part. But that may be because Netware doesn't have a kernel/user
separation.)

If you can do something like that, you can probably also do a distributed
OS with very similar code.

MfG Kai

Next message: Michael Elizabeth Chastain: "Re: Small speedup for 'system_call'"
Previous message: Ulrich Windl: "time warps: good bad news"