Please Help. TCP/IP tuning problem.

From: Peter Zaitsev (pz@spylog.ru)
Date: Mon Apr 17 2000 - 04:06:07 EST


I'm writing an application which stands like HTTP server. It serves now
about 120 request per second, from different connections (because the
clients are different)

Resently I worte a problem about connect slowdown, thanks to everybody it
was fixed by increasing tcp_max_syn_backlog. But now I'm having other quite
strange problem.

The program is writing like multi-threaded application there several threads
serving up to N sockets each in non-blocking mode. All threads listening
the same socket and the first ready thread assepts request for service.
Everything worked very nice before the load become notable - more than 100
requests per second.
Now the problem is quite strange and I can't understand why it's happening.
Usually I have 20-40 of open non-blocking sockets but sometimes with no
load or CPU usage rise this number rises up to maximum, what ever it is 128
or 1024, and nether comes back where the CPU usage still low.

The investigation shows somehow open sockets does not become ready for
reading or writing in a very long time - select waiting for 8 sockets may
not timeout in 20 seconds or so on while then system works normally this
number is much lower - the socket becomes readable/writeble in less then a
second in avarege - at least I do not see any selects with active
connections waiting for long time. Here is the status output from my
program:

Real accepts per 258 sec period: 28109 that is 108.949612 accepts/sec
Listen #0 is : Waiting in select(Active connections:1 (of 3)) for 0.008510
Listen #1 is : Waiting in select(Active connections:0 (of 3)) for 0.066529
Listen #2 is : Waiting in select(Active connections:0 (of 3)) for 1.538298
Listen #3 is : Waiting in select(Active connections:0 (of 3)) for 1.587188
Listen #4 is : Waiting in select(Active connections:1 (of 3)) for 0.043983
Listen #5 is : Waiting in select(Active connections:0 (of 3)) for 3.431778
Listen #6 is : Waiting in select(Active connections:0 (of 3)) for 0.964423
Listen #7 is : Waiting in select(Active connections:0 (of 3)) for 0.247505
Listen #8 is : Waiting in select(Active connections:1 (of 3)) for 0.426755
Listen #9 is : Waiting in select(Active connections:1 (of 3)) for 0.164025
Listen #10 is : Waiting in select(Active connections:0 (of 3)) for 0.268437
Listen #11 is : Waiting in select(Active connections:0 (of 3)) for 0.372799
Listen #12 is : Waiting in select(Active connections:0 (of 3)) for 0.447565
Listen #13 is : Waiting in select(Active connections:0 (of 3)) for 0.710461
Listen #14 is : Waiting in select(Active connections:0 (of 3)) for 0.607864
Listen #15 is : Waiting in select(Active connections:1 (of 3)) for 0.000108
Listen #16 is : Waiting in select(Active connections:1 (of 3)) for 0.460155
Listen #17 is : Waiting in select(Active connections:1 (of 3)) for 0.180504
Listen #18 is : Waiting in select(Active connections:0 (of 3)) for 0.714812
Listen #19 is : Waiting in select(Active connections:0 (of 3)) for 0.923238
Listen #20 is : Waiting in select(Active connections:0 (of 3)) for 0.053869
Listen #21 is : Waiting in select(Active connections:0 (of 3)) for 0.360183
Listen #22 is : Waiting in select(Active connections:1 (of 3)) for 0.350272
Listen #23 is : Waiting in select(Active connections:0 (of 3)) for 1.750227
Listen #24 is : Waiting in select(Active connections:0 (of 3)) for 1.159869
Listen #25 is : Waiting in select(Active connections:0 (of 3)) for 2.517232
Listen #26 is : Waiting in select(Active connections:0 (of 3)) for 0.586975
Listen #27 is : Waiting in select(Active connections:1 (of 3)) for 0.043037
Listen #28 is : Waiting in select(Active connections:0 (of 3)) for 6.051817
Listen #29 is : Waiting in select(Active connections:0 (of 3)) for 0.293510
Listen #30 is : Waiting in select(Active connections:0 (of 3)) for 1.192404
Listen #31 is : Waiting in select(Active connections:0 (of 3)) for 0.672941
Listen #32 is : Waiting in select(Active connections:0 (of 3)) for 0.374613
Listen #33 is : Waiting in select(Active connections:0 (of 3)) for 0.165230
Listen #34 is : Waiting in select(Active connections:1 (of 3)) for 0.312375
Listen #35 is : Waiting in select(Active connections:0 (of 3)) for 0.311362
Listen #36 is : Waiting in select(Active connections:0 (of 3)) for 0.544789
Listen #37 is : Waiting in select(Active connections:0 (of 3)) for 0.270548
Listen #38 is : Waiting in select(Active connections:0 (of 3)) for 0.044832
Listen #39 is : Waiting in select(Active connections:0 (of 3)) for 0.879379
Listen #40 is : Waiting in select(Active connections:1 (of 3)) for 0.378522
Listen #41 is : Waiting in select(Active connections:2 (of 3)) for 0.072031
Listen #42 is : Waiting in select(Active connections:1 (of 3)) for 0.455245
Listen #43 is : Waiting in select(Active connections:0 (of 3)) for 2.401986
Listen #44 is : Waiting in select(Active connections:0 (of 3)) for 0.067685
Listen #45 is : Waiting in select(Active connections:0 (of 3)) for 1.180331
Listen #46 is : Waiting in select(Active connections:0 (of 3)) for 0.204028
Listen #47 is : Waiting in select(Active connections:0 (of 3)) for 2.240351
Listen #48 is : Waiting in select(Active connections:1 (of 3)) for 0.547241
Listen #49 is : Waiting in select(Active connections:0 (of 3)) for 2.344869
Listen #50 is : Waiting in select(Active connections:0 (of 3)) for 0.067559
Listen #51 is : Waiting in select(Active connections:0 (of 3)) for 0.179676
Listen #52 is : Waiting in select(Active connections:0 (of 3)) for 0.167548
Listen #53 is : Waiting in select(Active connections:1 (of 3)) for 2.826066
Listen #54 is : Waiting in select(Active connections:1 (of 3)) for 0.533137
Listen #55 is : Waiting in select(Active connections:2 (of 3)) for 0.024815
Listen #56 is : Waiting in select(Active connections:0 (of 3)) for 0.907580
Listen #57 is : Waiting in select(Active connections:0 (of 3)) for 0.734941
Listen #58 is : Waiting in select(Active connections:1 (of 3)) for 0.003702
Listen #59 is : Waiting in select(Active connections:0 (of 3)) for 2.341024
Listen #60 is : Waiting in select(Active connections:0 (of 3)) for 0.024904
Listen #61 is : Waiting in select(Active connections:0 (of 3)) for 0.021143
Listen #62 is : Waiting in select(Active connections:0 (of 3)) for 1.063107
Listen #63 is : Cnt Checking Was Hit for -0.000904
Total active connections in listen:21 (of 192))

Last row is a time measured in seconds.

Now the small strange things I found -
The grater number of threads I have the smaller chance is this lockup will
hit me. The lowest system load I get with only 2-3 threads servecing on my
2.2.14SMP system. but I have to use many threads with small number of
sockets for each to have this problem to hit me less often.

Now the most interesting thing is how this lockup can be resolved - then
this "lockup happens" system can serve only really small number of requests
because it does not have much sockets to make accept in. The situation as I
wrote does not resove itself in a time but the following shure help:

echo 5 > /proc/sys/ipv4/tcp_max_sys_backlog

Wait a bit so the number of requests will fall and the the number of sockets
will drop even less then to normal level. Then I do this the socket_ready
time returns to normal

echo 512 > /proc/sys/ipv4/tcp_max_syn_backlog

After I have done this the system becomes able to work in normal way again -
I the number of connections/sec raises to normal level. The socket_ready
time returns to normal.

I was unable to repeat this problem localy - then benchmarking even from
number of client machines many connections from each system is able to
handle about 1000 connects(requests)/sec but not in a real world.

Does anyone has any Ideas why may this happen and how to resolve this ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:10 EST