Dumps on kernel panic

alex@cconcepts.co.uk
Tue, 4 Jul 1995 19:48:29 +0100 (BST)


> In message <m0sShc8-00013gC@iiit.swan.ac.uk>, iialan@iifeak.swan.ac.uk writes:
> >Marek Writes:
> >> I heard that some other systems actually do a crash dump to swap space.
> >> This might be useful, but of course we should be sure that we are writing
> >> in the right place... Maybe just check for swap-space signature first?
> >> A full memory dump might be very helpful in tracking down some problems.
> >
> >Often the swap data is also needed for the 'complete picture'. Also I have
> >heard enough horror stories of crash dumps from confused machines writing
> >to the wrong partition. At the point the machine crashes something is very
> >wrong. Drew discussed the ideas of a checksummed crash dumper in Germany,
> >that might be safe.
>
> I hashed out a rough design a while ago, but have been more concerned about
> other projects (non Linux). Now that I'm poking at Linux again,
> I'm more interested in avoiding crashes than making them useful :-)
>
> The problem :
>
> We want complete state information to go from RAM to
> some non-volatile storage media.
>
> The constraints we are working under :
>
> 1. If the system panic()'d, everything in RAM may be corrupted. So, to
> do a crash dump, we need to reinitialize all of the parts we are going
> to use.
>
> 2. Reinitializing destroys state information that we may want. It
> also gets nasty in a hurry - what if the memory management code
> screwed up and we can't dynamically allocate more space for the
> new copy of our code?
>
> Neat problem, isn't it?

There is another possibility I think which doesn't involve writing to the
disk at all from a damaged system. Forgive me if this has been suggested
before, and forgive me twice if it has been suggested and decried.

On system boot, allocate a fixed number of physical pages for preserving
'important' data. Let's say 2Mb (yes, I propose this only as a configuration
option!).

Make panic() copy the bottom megabyte, plus the call stack, plus the
registers and everything else it currently dumps, into this area, and
then sets a magic number somewhere in the bottom megabyte. Also checksum
each meg of memory, and stick this somewhere.

Here's the not-very cunning bit: Most PC's when you press the reset button
do *not* clear memory. If they do, it's often the memory check that
does it and BIOS initialisation, which often only affects the bottom
megabyte, and even then not always that. If the PC memory checks above
1Mb (which is normally determined by some BIOS word being 4321 or 1234
or similar - all documented) then it will, admittedly, corrupt this
area, but we can prevent this by setting the word appropriately which
will fix us up the majority of case.

Now all we have to do is prevent LILO from destroying our valuable data
in its path. 'All' we have to do is ensure that the areas LILO attacks
were backed up to our safety area. Very early on in the kernel boot
sequence, (before the MM is inited), it sees (from the magic number)
that it is on a crash recovery boot, marks the appropriate areas in the
memory map as areas it can't use, and then comes up on an extremely
low memory system, which won't turn swapping on (so we don't corrupt swap
files).

The first thing this does write the whole of the 'unusable' (i.e. valuable)
part of memory out to disk, and backup the swap partition. This gives you
a complete image of your entire system, except for the 2Mb or so you
reserved for backup purposes which hopefully won't contain anything useful.

Note you aren't getting memory for free - you are losing 2Mb of RAM. This
option tush isn't going to suit everyone, but it might be a useful debugging
tool.

Alex

----------------------------+-------------+-----------------------------
Alex Bligh : ,-----. :
Computer Concepts Ltd. : : : alex@cconcepts.co.uk
Gaddesden Place : : ,-----. :
Hemel Hempstead : `-+---` ` : Tel. +44 1442-351000
Herts. UK HP2 6EX : | , : Fax. +44 1442-351010
: `-----` :
----------------------------+-------------+-----------------------------