Re: stability problems with 2.4.24/Software RAID/ext3

From: Marcelo Tosatti
Date: Thu Jan 08 2004 - 12:18:53 EST




On Thu, 8 Jan 2004, martin f krafft wrote:

> Hi all,
>
> I operate a groupware server which is giving me a very hard time.
> It's an AMD Athlon XP 3000+ with 1Gb of RAM, and four 7200 UPM IDE
> harddrives, two attached to the primary channels of the onboard
> controller, and two to the primary channels of a Promise 20269 EIDE
> controller. The kernel is a 2.4.24 with the configuration I placed
> here:
> ftp://ftp.madduck.net/scratch/config-2.4.24-gaia.gz
>
> The system is configured with 7 Software-RAID and three swap partitions:
>
> md1: /boot (ext3) RAID 1 spanning hda1 and hdc1
> md5: / (ext3) RAID 5 hda5/hdc5/hde5 with hdg5 as a spare
> md6: /usr (ext3) RAID 5 hda6/hdc6/hde6 with hdg6 as a spare
> md7: /var (ext3) RAID 5 hda7/hdc7/hde7 with hdg7 as a spare
> md8: /usr/local (ext3) RAID 5 hda8/hdc8/hde8 with hdg8 as a spare
> md9: /home (ext3) RAID 5 hda9/hdc9/hde9 with hdg9 as a spare
> md10: /tmp (ext3) RAID 5 hda10/hdc10/hde10 with hdg10 as a spare
>
> hda2 holds a non-RAID rescue system with RAID 1/5 supporrt
>
> hdc2, hde2, hdg2 are swap partitions of 256 Mb each.
>
> hde1 and hdg1 are unused.
>
> The individial harddisks are identically tweaked with hdparm:
>
> hdparm -A1 -B255 -c1 -d1 -p -u0 -W0 -Xudma6 /dev/hd{a,c,e,g}
>
> See the end of this mail for details.
>
> The system experiences severe stability problems, which I relate to
> the filesystem, RAID, or controller code, because it's reproducible
> with excessive disk operations. E.g., doing something like
>
> rsync -a --exclude /tmp / /tmp/dump

<snip>

> Interestingly, just now, the machine crashed differently. `vmstat 1`
> was still running, but new processes could not be started, after the
> kernel reported a lot of oopses in user-space processes (e.g. rsync,
> top, zsh), as well as some of the kjournald oopses like above.
> I have included the footprint of the user-space program oopses
> further down. `vmstat 1` was happily printing the following away,
> when the system was already unusable. The b > 127 value is
> interesting, as it has been continuously increasing (well, in
> a non-decreasing way) after a certain point, and somewhere on the
> way, the system reached the state of agnosia.
>
> 0 127 2 16124 10304 43004 682188 0 0 0 0 109 7 0 0 100
> 0 127 2 16124 10304 43004 682188 0 0 0 0 111 5 0 0 100
> 0 127 2 16124 10304 43004 682188 0 0 0 0 114 9 0 0 100
> 0 127 2 16124 10304 43004 682188 0 0 0 0 111 5 0 0 100
> 0 127 2 16124 10304 43004 682188 0 0 0 0 115 9 0 0 100
> 0 128 2 16124 10420 43004 682060 0 0 0 0 119 12 0 0 100
> 0 128 2 16124 10420 43004 682060 0 0 0 0 122 11 0 0 100
>
> Apart from these panics and hangups, the system also randomly issues
> segfaults to processes, or reports a kernel oops. These take the
> following form:

Hi Martin,

I can't help you much, but I believe your problem might be related to
faulty hardware. Have you checked if the memory OK ?

Try disabling DMA on the Promise?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/