Accept() problem

Zachary Williams (admin@ztnet.com)
Wed, 17 Nov 1999 13:29:42 -0500


This is a multi-part message in MIME format.

------=_NextPart_000_0016_01BF30FF.CE26B460
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

BUG: In a load balancing enviornment going through hardware such as a
server-iron, apache can be flooded with requests (this could be due to a
traffic spike) and will eventually stop responding. The load balancer =
will
then remove that server from 'active' status, because it fails to =
respond to
http health checks. In most cases, only a few of the children ever =
recieve
requests. Because of the limited children actually responding to =
requests,
the load-balancer will never put the server back to 'active' status,
therefor leaving the server down, until its children are killed (killing =
the
parent, and restarting apache. a -HUP WILL NOT WORK!).

Info: This is NOT, and I stress, is NOT an apache bug! All children =
are
healthy, and are either locked, waiting to become the next child with
accept(15, or the one is on accept(15 (If you have it setup to use the
serialized accept) or there is the trampling herd if not. Either way, =
it is
affected. The bug, appears to be with the kernel. We have been able to
make this problem occur on kernel 2.2.9 and 2.2.12. 2.2.13, and =
2.2.14pre6. =20
We believe the bug is in the (buggy) wake-one code that was beggining to =
be=20
included in kernel 2.2.8 or so (however, this is just speculation, I'll =
leave it up=20
to the guru's to tackle this one. :) ). We have been UNABLE to =
reproduce this=20
failure on the latest development kernel 2.3.28.

This is a difficult to reproduce bug, however, given a
load-balancing setup, it is very obvious, because of the certain =
conditions
met. (requests MUST stop going to the affected server, otherwise =
children
will respawn, and act normally.) Single server users will notice a
'slowdown' period, that lasts anywhere from 30 seconds to a few minutes,
while the system kicks back into gear.

I am not sure if I'm properly subscribed to this list, so please CC =
admin@ztnet.com with any responses. Thanks.

Zach

------=_NextPart_000_0016_01BF30FF.CE26B460
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

BUG: In a load balancing enviornment = going through=20 hardware such as a
server-iron, apache can be flooded with requests = (this=20 could be due to a
traffic spike) and will eventually stop = responding. =20 The load balancer will
then remove that server from 'active' status, = because=20 it fails to respond to
http health checks.  In most cases, only = a few of=20 the children ever recieve
requests.  Because of the limited = children=20 actually responding to requests,
the load-balancer will never put the = server=20 back to 'active' status,
therefor leaving the server down, until its = children=20 are killed (killing the
parent, and restarting apache.  a -HUP = WILL NOT=20 WORK!).

Info:  This is NOT, and I stress, is NOT an apache=20 bug!  All children are
healthy, and are either locked, waiting = to become=20 the next child with
accept(15, or the one is on accept(15  (If = you have=20 it setup to use the
serialized accept) or there is the trampling herd = if=20 not.  Either way, it is
affected.  The bug, appears to be = with the=20 kernel.  We have been able to
make this problem occur on kernel = 2.2.9=20 and 2.2.12. 2.2.13, and 2.2.14pre6. 
We believe the bug is in the (buggy) = wake-one code=20 that was beggining to be
included in kernel 2.2.8 or so = (however, this is=20 just speculation, I'll leave it up
to the guru's to tackle this one. :) = ).  We=20 have been UNABLE to reproduce this
failure on the latest development = kernel=20 2.3.28.
This is a difficult to reproduce bug, = however,=20 given a
load-balancing setup, it is very obvious, because of the = certain=20 conditions
met.  (requests MUST stop going to the affected = server,=20 otherwise children
will respawn, and act normally.)   = Single server=20 users will notice a
'slowdown' period, that lasts anywhere from 30 = seconds to=20 a few minutes,
while the system kicks back into gear.

I am not = sure if=20 I'm properly subscribed to this list, so please CC admin@ztnet.com with any = responses. =20 Thanks.
 
Zach
------=_NextPart_000_0016_01BF30FF.CE26B460-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/