kernel panic in SoftwareRAID autodetection

From: Roberto Ragusa (robertoragusa@technologist.com)
Date: Fri Dec 01 2000 - 09:50:49 EST


Please CC to me because I'm not a LKML subscriber.

Hi,

I found a real showstopper problem in the SoftwareRAID autodetect
code; 2.4.0-test10 and 2.4.0-test11 are affected (I didn't test
previous versions).

I'm using two IDE disk with some RAIDed partitions:

  md5 : active raid0 hdc5[1] hda5[0] 2056064 blocks 32k chunks
  md6 : active raid0 hdc6[1] hda6[0] 1027968 blocks 32k chunks
  md7 : active raid0 hdc7[1] hda7[0] 634752 blocks 32k chunks
  md8 : active raid0 hdc8[1] hda8[0] 1027968 blocks 32k chunks
  md9 : active raid0 hdc9[1] hda9[0] 2237568 blocks 32k chunks
  md10 : active raid0 hdc10[1] hda10[0] 60288 blocks 32k chunks
  md11 : active linear hdc13[3] hdc11[1] hda13[2] hda11[0] 422912 blocks 32k rounding

that is 6 of them are RAID0 and 1 is RAIDlinear.

I compiled the 2.4.0-test11 kernel on a RH7.0 installation (using kgcc),
with the following options:

  CONFIG_MD=y
  CONFIG_BLK_DEV_MD=y
  CONFIG_MD_LINEAR=y
  CONFIG_MD_RAID0=y
  CONFIG_MD_RAID1=y
  CONFIG_MD_RAID5=m
  # CONFIG_MD_BOOT is not set
  CONFIG_AUTODETECT_RAID=y
  # CONFIG_BLK_DEV_LVM is not set
  # CONFIG_LVM_PROC_FS is not set

but when I try to boot the kernel panics during the RAID autodetect phase
(without autodetection it seems OK). Tried many times, not a random problem.
So I used "console=ttyS0" to grab the output and I found it is booting
without errors (!). Tried many times.
I had to hand copy the text from the screen to this mail.

This is what I see when everything is OK (serial console grab):

  Linux version 2.4.0-test11 (root@linux86.end) (gcc version egcs-2.91.66 19990314
  /Linux (egcs-1.1.2 release)) #1 Fri Dec 1 12:27:42 CET 2000
[...]
  Partition check:
   hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 hda10 hda11 hda12 hda13 hda
  14 >
   hdc: hdc1 hdc2 hdc3 hdc4 < hdc5 hdc6 hdc7 hdc8 hdc9 hdc10 hdc11 hdc12 hdc13 hdc
  14 >
   hdd: hdd1 hdd2 hdd3 hdd4
[...]
  md driver 0.90.0 MAX_MD_DEVS=256, MAX_REAL=12
  linear personality registered
  raid0 personality registered
  raid1 personality registered
  md.c: sizeof(mdp_super_t) = 4096
  autodetecting RAID arrays
  (read) hda5's sb offset: 1028032 [events: 00000ace]
  (read) hda6's sb offset: 513984 [events: 00000aca]
  (read) hda7's sb offset: 317376 [events: 00000aca]
  (read) hda8's sb offset: 513984 [events: 00000aca]
  (read) hda9's sb offset: 1118784 [events: 00000096]
  (read) hda10's sb offset: 30144 [events: 00000096]
  (read) hda11's sb offset: 105728 [events: 000000a4]
  (read) hda13's sb offset: 105728 [events: 000000a4]
  (read) hdc5's sb offset: 1028032 [events: 00000ace]
  (read) hdc6's sb offset: 513984 [events: 00000aca]
  (read) hdc7's sb offset: 317376 [events: 00000aca]
  (read) hdc8's sb offset: 513984 [events: 00000aca]
  (read) hdc9's sb offset: 1118784 [events: 00000096]
  (read) hdc10's sb offset: 30144 [events: 00000096]
  (read) hdc11's sb offset: 105728 [events: 000000a4]
  (read) hdc13's sb offset: 105728 [events: 000000a4]
  autorun ...
  considering hdc13 ...
    adding hdc13 ...
    adding hdc11 ...
    adding hda13 ...
    adding hda11 ...
  created md11
  bind<hda11,1>
  md11: WARNING: hda13 appears to be on the same physical disk as hda11. True
       protection against single-disk failure might be compromised.
  bind<hda13,2>
  bind<hdc11,3>
  md11: WARNING: hdc13 appears to be on the same physical disk as hdc11. True
       protection against single-disk failure might be compromised.
  bind<hdc13,4>
  running: <hdc13><hdc11><hda13><hda11>
  now!
  hdc13's event counter: 000000a4
  hdc11's event counter: 000000a4
  hda13's event counter: 000000a4
  hda11's event counter: 000000a4
  md11: max total readahead window set to 124k
  md11: 1 data-disks, max readahead per data-disk: 124k
  md: updating md11 RAID superblock on device
  hdc13 [events: 000000a5](write) hdc13's sb offset: 105728
  hdc11 [events: 000000a5](write) hdc11's sb offset: 105728
  hda13 [events: 000000a5](write) hda13's sb offset: 105728
  hda11 [events: 000000a5](write) hda11's sb offset: 105728
  .
  considering hdc10 ...
    adding hdc10 ...
    adding hda10 ...
  created md10
  bind<hda10,1>
  bind<hdc10,2>
  running: <hdc10><hda10>
  now!
  hdc10's event counter: 00000096
  hda10's event counter: 00000096
  md10: max total readahead window set to 496k
  md10: 2 data-disks, max readahead per data-disk: 248k
  raid0: looking at hda10
  raid0: comparing hda10(30144) with hda10(30144)
  raid0: END
  raid0: ==> UNIQUE
  raid0: 1 zones
  raid0: looking at hdc10
  raid0: comparing hdc10(30144) with hda10(30144)
  raid0: EQUAL
  raid0: FINAL 1 zones
  zone 0
   checking hda10 ... contained as device 0
    (30144) is smallest!.
   checking hdc10 ... contained as device 1
   zone->nb_dev: 2, size: 60288
  current zone offset: 30144
  done.
  raid0 : md_size is 60288 blocks.
  raid0 : conf->smallest->size is 60288 blocks.
  raid0 : nb_zone is 1.
  raid0 : Allocating 8 bytes for hash.
  md: updating md10 RAID superblock on device
  hdc10 [events: 00000097](write) hdc10's sb offset: 30144
  hda10 [events: 00000097](write) hda10's sb offset: 30144
  .
  considering hdc9 ...
[...]

When I don't use the serial console, only the first RAID is well
detected; I get something different: (hand copied)

  md: updating md11 RAID superblock on device
  hdc13 [events: 000000a7](write) hdc13's sb offset: 105728
  hdc11 [events: 000000a7](write) hdc11's sb offset: 105728
  hda13 [events: 000000a7](write) hda13's sb offset: 105728
  hda11 [events: 000000a7](write) hda11's sb offset: 105728
  .
  considering hdc10 ...
  Unable to handle kernel NULL pointer dereference at virtual address 00000000
   printing eip:
  c0127da2
  *pde = 000000
  Oops: 0002
  CPU: 0
  EIP: 0010:[<c0127da2>]
  EFLAGS: 00010093
  eax: 00000000 ebx: cbf4a1e0 ecx: cbff8f40 edx: 00000001
  esi: cbffc638 edi: 00000286 ebp: 00000000 esp: c133df6c
  ds: 0018 es: 0018 ss: 0018
  Process swapper (pid: 1, stackpage=c133d000)
  Stack: cbf1c000 00000001 cbf4b080 cbf4b000 00001000 000bf1c0 c01af03f cbf1c000
         cbf27090 cbf27180 c133dfc4 cbf27180 c01b08b1 cbf23000 cbf23000 0000160d
         00000010 c0105000 0008e000 00160001 cbf27480 c133dfc4 c133dfc4 c133dfc4
  Call Trace: [<c01af03f>] [<c01b08b1>] [<c0105000>] [<c0107007>] [<c01089ff>]
  Code: 89 18 8b 46 04 89 4e 04 89 31 89 41 04 89 08 eb 27 8b 43 08
  kernel panic: Attempted to kill init!

Ksymoops says:

>>EIP; c0127da2 <kfree+7e/c0> <=====
  Trace; c01af03f <sb_equal+7b/94>
  Trace; c01b08b1 <autorun_devices+85/230>
  Trace; c0105000 <empty_bad_page+0/1000>
  Trace; c0107007 <init+7/150>
  Trace; c01089ff <kernel_thread+23/30>
  Code; c0127da2 <kfree+7e/c0> 00000000 <_EIP>:
  Code; c0127da2 <kfree+7e/c0> 0: 89 18 mov %ebx,(%eax) <=====
  Code; c0127da4 <kfree+80/c0> 2: 8b 46 04 mov 0x4(%esi),%eax
  Code; c0127da7 <kfree+83/c0> 5: 89 4e 04 mov %ecx,0x4(%esi)
  Code; c0127daa <kfree+86/c0> 8: 89 31 mov %esi,(%ecx)
  Code; c0127dac <kfree+88/c0> a: 89 41 04 mov %eax,0x4(%ecx)
  Code; c0127daf <kfree+8b/c0> d: 89 08 mov %ecx,(%eax)
  Code; c0127db1 <kfree+8d/c0> f: eb 27 jmp 38 <_EIP+0x38> c0127dda <kfree+b6/c0>
  Code; c0127db3 <kfree+8f/c0> 11: 8b 43 08 mov 0x8(%ebx),%eax
  
  kernel panic: Attempted to kill init!

I hope this is enough to track down the problem.
Please CC to me because I'm not a LKML subscriber.

--
             Roberto Ragusa   robertoragusa at technologist.com

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Dec 07 2000 - 21:00:07 EST