Re: dump device

From: almesber@lrc.di.epfl.ch
Date: Tue Jul 04 2000 - 02:23:39 EST


Tigran Aivazian wrote:
> ftp://icaftp.epfl.ch/pub/people/almesber/misc/ols2k-9.ps.gz

One little warning: this isn't the final URL. I'll decide on that in the
next few days. (Depends mainly on whether I can host booting.org at a
more suitable place than my PC at home.)

> In this paper he describes a dual-kernel mechanism where a small kernel
> for raking the crashdump is preloaded along with a initrd and when we
> panic the preloaded stuff is verified and that kernel is launched to do
> the actual coredump.
>
> http://www.missioncriticallinux.com/technology/coredump/

This is roughly the concept that we've discussed. I didn't look at their
current implementation, so there may be some differences. (My current
focus is on making bootimg stable on as many i386 configurations as
possible. Yesterday, I could finally lay hands on that SMP box, so I can
now do useful work with bootimg again.)

In any case, I think there's actually not much choice for a dumping
technology on i386 than this approach, considering that:

 - you better don't trust the BIOS
 - you want some flexibility (e.g. dump to any file/partition; maybe
   also ring the pager, etc.)
 - you don't want to re-write lots of drivers
 - you don't want to change the state of the crashed kernel before
   the dump is written

Unfortunately, this still doesn't solve:

 - hangs (without prior oops *) requiring a hardware reset
 - massive performance degradation without oops
 - spontaneous reboots (e.g. triple fault)
 - dump device stuck in an unrecoverable state

The last two cases should be rare. The first two cases could perhaps
be solved by using the NMI to invoke the dumper. (Are there any PCI
cards on the market with an NMI button ?) In some cases, a "hot key"
that can't be caught/blocked by X and such could be used as a trigger
too.

(* in my experience, the kernel hardly ever panics. Most of the time,
   I get one or more oops (sometimes an infinite stream of them),
   before the system dies completely. A dump should therefore be
   taken on the first oops, even if it's a "harmless" one. The more
   sensitive the application, the less one should be willing to trust
   your system after any oops anyway.)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, ICA, EPFL, CH       werner.almesberger@ica.epfl.ch /
/_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Jul 07 2000 - 21:00:14 EST