Re: BUG: Slowdown on 3000 socket-machines tracked down

From: Willy Tarreau
Date: Mon Mar 07 2005 - 00:32:18 EST


On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:

> I think you would have better luck in reproducing this problem if you
> did the full sendfile thing.
>
> I think it is becoming disk bound due to page reclaim problems, which
> is causing the slowdown.
>
> In that case, writing the network only test would help to confirm the
> problem is not a networking one - so not useless by any means.

Not necessarily, Nick. I have written an HTTP testing tool which matches
the description of Ben's : non-blocking, single-threaded, no disk I/O,
etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,
especially if I start some CPU activity on the system, I can get pauses
of up to 13 seconds without this tool doing anything !!! At first I
believed it was because of the scheduler, but it might also be related
to what is described here since I had somewhat the same setup (gigE, 1500,
thousands of sockets). I never had enough time to investigate more, so I
went back to 2.4.

It makes me think that for the problem described here, we have no
indication of CPU & I/O activity, which might help Ben try to reproduce.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/