Bug? 2.4.17-rc1 RAID1, SMP, 750MB+ -> hangups...

From: Matti Aarnio (matti.aarnio@zmailer.org)
Date: Thu Dec 27 2001 - 16:45:42 EST


I have had problems with 2.4.17-rc1, but also with
2.4.10-ac1, and few others in between.

System with problems:
  2.4.17-rc1, 2.4.10
  Dual P-III 750 MHz
  750 MB (or 1024 MB)
  SCSI disks with RAID1 mirroring of all partitions (including swap)
  (Had IDE-disks in previous incarnation, same behaviour.)
  EXT3
  Some 410 000 files in filesystems
  (no cmdline parameters)
    --> hangups

Systems without (apparently) problems:
  2.4.17 (release)
  Dual P-II 266 MHz
  256 MB
  IDE disks with RAID1 mirroring of all partitions (including swap)
  (these disks were used at the trouble machine)
  EXT3
  Some 130 000 files in filesystems
  (no cmdline parameters)
     --> works

  rh2.4.16-0.9custom
  Dual P-III 800 MHz
  512 MB
  IDE disks with RAID1 mirroring of all partitions (including swap)
  EXT2
  Some 1.2 million files in filesystems
  (no cmdline parameters)
     --> works
  
  2.4.10
  Dual PPro200
  128 MB
  IDE disks wirth RAID1 mirroring of all partitions (including swap)
  EXT2
  Some 200 000 files in filesystems
  ("noapic" -- machine has faulty MP dataset)
     --> works

Elsewere talking about the problems, I got a comment that the problem
does appear with largeish memory SMP machines with RAID1 (mirror)
disks, and the problem seem to go away when "noapic" option is used
forcing all IO interrupts to fall into one processor.

As I understood, 2.4.9 does not exhibit this problem, but 2.4.10 thru
2.4.13 do, the commentator returned to 2.4.9 for production systems...
Apparently the problem appears all the way thru to 2.4.17.

The problem appears as complete hungup of the system, when something
is run at 4:00 - 4:20 AM by cron (at a RedHat system: updatedb, and/or
"catman -w", and/or ...)

Machine attached keyboard can not command anything when the hangup
has happened. Configured magic-sysrq doesn't work during hangup.

Problem machine is now being run with: "noapic nosmp nmi_watchdog=1"
which may (or may not) help. Seen comments give an impression,
that "nosmp" could be removed, as long as "noapic" is present for
stable operation.

I can't test the problem, the machine now running in forced
uniprocessor mode is mission-criticalish box, and anyway trying
to trigger the problem does appear rather difficult -- besides
of it being nearly guaranteed to happen at 4 AM - if not tonight,
within 2-3 nights anyway.

/Matti Aarnio <matti.aarnio@zmailer.org>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Dec 31 2001 - 21:00:15 EST