Linux 2.2.2 TCP delays every 41st small packet by 10-20 ms

Josip Loncaric (josip@icase.edu)
Thu, 25 Feb 1999 14:14:32 -0500


This was originally posted to MPICH and LAM mailing lists, but should be
of
interest to Linux kernel and network developers as well. We have a
Beowulf
class cluster of Pentium II nodes where MPI-based codes can run in
parallel.

Josip Loncaric wrote:
>
> William Gropp wrote:
> >
> > At 11:44 AM 2/17/99 -0500, Josip Loncaric wrote:
> > >...
> > >
> > >I suspect that under certain conditions the TCP in Linux kernel 2.0.36
> > >might be delaying some short packets.
> >
> > MPICH sets TCP_NODELAY; this works on all other Unix systems. If it
> > doesn't work as expected under LINUX, that would explain much.
>
> TCP_NODELAY may work as advertised for larger packets, but there are
> interesting things going on with smaller packets.

I tested Linux TCP streaming using a modified netpipe-2.3 code which
collected timestamps from the Pentium II tick counter. This has the CPU
clock frequency resolution (400 MHz in our case). Our systems use
NetGear FA310TX cards (some with DEC chips, most with Lite-On chips) and
the latest testing version 0.90Q of tulip.c driver.

The conclusion is this: in Linux TCP, every N-th small packet is delayed
by 1-2 "jiffies" (defined by the 100 Hz system clock). For Linux kernel
2.0.36, N=35; for kernels 2.2.1 and 2.2.2, N=41. "Small" means smaller
than K bytes, where K is about 509 in kernel 2.0.36 and about 93 and 125
in kernels 2.2.1 and 2.2.2, respectively.

What does this mean? Well, if MPI is streaming small messages (e.g. 16
bytes each) via TCP in the latest Linux kernel 2.2.2, the first 40
messages will be spaced about 17 microseconds apart. Every 41st packet
will be delayed by 10,000 or 20,000 microseconds. For some of our
MPI-based codes, these delayed packets are very, very bad news.

Interestingly enough, larger packets do not suffer from this problem.
Also, spacing small packets by 100 microseconds at the sending side does
not change the result. There is some reason to suspect that ACK logic
is to blame, where the send/receive/ack process stalls until something
times out 1-2 jiffies later. Whatever the cause, this problem is still
present in the latest Linux kernel 2.2.2.

I suspect that this accounts for very uneven MPI performance in
mpich-1.1.2. Some of our codes stalled completely. This can happen
both with mpich-1.1.2 and with lam-6.1 using the -c2c flag. The code
runs fine using lam-6.1 without the -c2c flag, but slower.

Finally, I found it rather curious that a stalled MPI code would
sometimes resume running if we sent a single "ping" to all hosts.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 403                        http://www.icase.edu/~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/