Re: [2.2.13aa6 (bugfix release II) ]

ursus (ursus@usa.net)
20 Dec 99 14:51:32 EST


In newsgroup fa.linux.kernel, Andrea Arcangeli wrote:

> Date: Fri, 17 Dec 1999 16:34:21 +0100 (CET)
> From: Andrea Arcangeli <andrea@suse.de>
> Subject: 2.2.13aa6 (bugfix release II)
>
> [...]
> The main features of 2.2.13aa6 are:
>
> o Support for 4Gigabyte of RAM (me and Gerhard.Wichert)
> o Improved VM for high end machines with enough ram and doing
> heavy I/O under high memory pressure (me)
> o RAW-IO (also on bigmem) (Stephen C. Tweedie)
>
> o updated with all showstopper/necessary bugfixes discovered into
> the 2.2.x kernels over the time.
>

Andrea:

Thanks for the updated 2.2.13aa6 patchset, especially that
it works with the raid-0.90 patches cleanly! I've been using
Alan Cox's 2.2.13ac3 patches for the raid-0.90 support,
but really wanted to run with your SMP scheduling changes,
since they would seem to help performance/stability with
my application (high-load webserver on dual-PIII machine).
Also I was getting errors regarding "Out of memory" which
you have a couple of patches for in aa6 ...

I upgraded a cluster of servers (Compaq 6400R, 2 x PIII-500)
from 2.2.13ac3 to 2.2.13aa6+raid-0.90 (and the incremental
"set_blocksize" patch you kindly provided) and Don Becker's
eepro.c 1.09l (not sure if this is latest?) in hopes I can
finally have a really stable setup ... these had been running
well for about 12 hours, but I just had one of the servers
crash with the following error (seen before under 2.2.13ac3):

wait_on_bh, CPU 3: (this is the first processor)
irq: 0 [0 0]
bh: 1 [0 0]
<[8010b39d]> <[80150daa]> <[80150d46]> <[8012912b]> \
<[8012a367]> <[801291a6]> <[8012921f]> <[801092ac]>

I tried to correlate the registers above with System.map:

8010b360 T synchronize_bh
8010b3b0 T synchronize_irq

80150d20 t sock_close
80150d5c t sock_fasync

8012910c T __fput
80129154 T filp_close

8012a350 T fput
8012a398 T put_filp

80129154 T filp_close
801291b0 T sys_close

801291b0 T sys_close
80129238 T sys_vhangup

80109278 T system_call
801092b0 T ret_from_sys_call

If I press ALT+SysRq+P, the EIP shows "0010:[<80166671>]"
which appears to be related to functions (from System.map):

80166660 T tcp_send_delayed_ack
801666b4 T tcp_send_ack

In some earlier posts I read that "wait_on_bh"
means that the system is waiting on the bottom half
(SMP-specific), so I've edited my /etc/lilo.conf
to add "nosmp noapic", and I'll see if the servers
run stable w/o SMP ... this isn't a real solution
of course.

Any help/pointers/patches would be greatly appreciated.
In an earlier post I mentioned this is part of a larger
project to upgrade about 100 webservers based on 2.0.36
kernel to 2.2.13+ ... the overall load is 1Billion hits
per day currently. This would be a yet another testament
to Linux's viability in the enterprise environment,
assuming I can nail down this SMP problem :)

PS: in your directory on the ftp.*.kernel.org mirrors,
I see a patch regarding bh_latency for 2.2.14pre;
does this address the above "wait_on_bh" problem?

Thanks in advance

--
ursus@usa.net

____________________________________________________________________ Get free email and a permanent address at http://www.netaddress.com/?N=1

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/