Re: KMSGDUMP: dump kernel messages to a diskette

Dan Hollis (goemon@sasami.anime.net)
Thu, 8 Jul 1999 14:37:01 -0700 (PDT)


On Thu, 8 Jul 1999, Willy Tarreau wrote:
> > > Well, in that case you could use 0xa, which is handled by all i386 and
> > > newer BIOSes, AFAIK, and requires less messing. Neither are safe, anyway,
> > > because they require RAM contents to be valid which might not be the case.
> > > And executing from clobbered RAM mau cause further destruction -- e.g.
> > > some unwanted hard disk writes.
> > md5 the important parts of memory, and then checksum verification before
> > using them. quite simple, and 100% guarantees we don't use clobbered ram.
> what if the md5-verifying function is clobbered itself ?

Then the md5 will fail and you will kindly advise the user that a dump is
not possible (or warn them it is extremely unwise).

Of course, we could spend the rest of the year making up excuses about
situations where dumping can't work. Do you really want to do that, or do
you want to examine the cases where dumping can work?

> - upon crash, the messages are immediately dumped to sector 2 of the
> first hard disk known to the bios. There should be at least one for lilo.
> This space is nearly never used (except by viruses) and can handle several
> sectors of data.

If the hardware is completely locked up (eg some card has the bus locked),
this will not be possible. That's why any disk writing should not be done
immediately on an OOPS while still in a kernel -- because you cannot be
sure at that point what state the hardware is in.

At least after a hardware reset (red button reset), you would be
reasonably sure (more sure than you would be when dumping at OOPS time)
that the hardware is in a state where it can be initialized and a proper
dump can be done.

> - the system reboots through the bios.
> - lilo sees anormal data in these sectors and copies them itself to the
> right place, while marking them read.
> - it then reboots completely (hard reboot) to ensure better cleaning.
> - when it gets back, it sees the sectors marked as "read" so it boots
> normally.

Here is my scenario:

- PC locks up, HARD, due to interaction between PCI cards and buggy
drivers. no oops is possible since the harware is locked up.
- Press PC "red reset button" for hardware reset.
- PC enters minimal dump code using '286 return to real mode' trick.
- dump code would verify md5 of itself. if md5 is not valid, it asks
user if they really want to try continue dumping but warns them that
it might corrupt data. on the other hand if md5 is valid, it
initializes hardware minimally if needed, and loads lilo.
- lilo dumps memory to swap paritition, then completely reboots the PC.
- lilo loads linux.
- 'swapon' sees a memory dump in the swap, and writes that to
/var/crash/crashdump-bla-bla (of course, this dir and all files are
mode 600 and owned by root).

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/