Failover Kernel

From: Tarkan Erimer
Date: Thu Feb 26 2009 - 03:59:26 EST


Hi all,

I'm thinking about a kernel feature called "Failover Kernel". The basic idea is to put 2 kernels (One is running "Primary Kernel" and the next one is "Backup Kernel") into the memory for disaster recovery of kernel panic'ing/crashing.

This feature's working schema could be like this :

- "Backup Kernel" could be stated and loaded into the memory via a boot line option like : "failover_kernel=/boot/vmlinuz-2.6.26"
- Primary running kernel will send keepalives to the "Backup Kernel" to state that it's alive.
- Primary running kernel can write a journal (like the journaled filesystems.) about needed infos for the backup kernel to recover.
- When the primary kernel crashed and couldn't send anymore keepalives, the backup kernel will recover from this journal to proceed to where the primary kernel left and will become primary.
- When "Backup Kernel" became "Primary" it will load the previous one as "Backup Kernel" again or maybe it could be left to manual. User could decide after the disaster recovery which kernel will be load as backup via a utility like "kexec".
- At kernel compile time, user can choose the the timing for failover kernel. For example, "Recover After 10 MS. of inactivity (not receiving keepalives). "


The usage scenarios of this feature could be :

- For people whose Datacenter is remote, it's a big problem when you compiled a new kernel and rebooting into a crashing/non-booting new kernel. You left with a completely crashed and non-functioning system. Hard reset and manual action is required. If there could be "Failover Kernel feature, the system will simply switch back to the "Backup Kernel" (This backup kernel will be the known stable kernel of the system.) and the system will proceed to work without any manual action required.

- Your system runs fine for the last several months and one day you hit a bug and kernel crashed/panic'ed . With "Failover Kernel", the system will switch to the "Backup Kernel" quickly (maybe some milliseconds or few seconds.) to recover and the system could proceed to work normally.

So,I'm not a coder and I don't know it is really possible as technically or not. You the kernel hackers, what's your opinion about it ? Could it be really possible ? If so, how we really can implement it ?

Many thanks for reading this long (and maybe stupid) post! :-)

Tarkan ERIMER


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/