Re: Very poor TCP/SACK performance

David S. Miller (davem@dm.cobaltmicro.com)
Tue, 8 Sep 1998 00:42:13 -0700


Date: Mon, 7 Sep 1998 01:35:28 -0400 (EDT)
From: Jeff DeFouw <mrj@i2k.com>

I noticed two different problems while a friend was receiving a file
from my system. The first one might not be a real problem, I don't know
exactly how SACK resends are handled. The second problem looks like a
definite bug in TCP somewhere.

Thanks for making this report, I'll be looking into it.
But if you provide more dumps to help me debug this problem could
you please rebuild tcpdump with the patch I have appended at the end?
The stock tcpdump does not output SACK information from TCP packets
properly without the patch. The stock tcpdump uses the pre-RFC format
of the SACKS which is nothing like real modern SACKs in use today :-)

General info about the transfer:
207.75.231.233 (2.1.118) receiving a 4.5meg file over FTP from
207.75.226.50 (2.1.120). Both systems are dialups, 207.75.231.233 has a
poor quality connection. Info is from 207.75.226.50's side. I'll refer
to them by the last 16 bits now. No other significant transfers were
taking place at the time of the problems.

Are you sure the tcpdump was run from 207.75.226.50's side? The
behavior of the sender seems to indicate otherwise, I'll show you why
below. Hmmm... having correct tcpdump SACK output would have helped
here but...

23:32:32.941103 207.75.226.50.20 > 207.75.231.233.1075: P 75981:76505(524) ack 0 win 32488 <nop,nop,timestamp 3926083 730030> (DF) [tos 0x8]
23:32:32.941103 207.75.226.50.20 > 207.75.231.233.1075: P 64977:65501(524) ack 0 win 32488 <nop,nop,timestamp 3926084 730030> (DF) [tos 0x8]

Here packet with sequence number 64977 has been retransmitted after
many other (subsequent) frames were sent for the first time.

23:32:33.661103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730126 3926083,nop,nop,sack 34356@16889 34880@16889> (DF) [tos 0x8]
23:32:33.871103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730148 3926083,nop,nop,sack 34356@16889 35404@16889> (DF) [tos 0x8]
23:32:34.081103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730169 3926083,nop,nop,sack 34356@16889 35928@16889> (DF) [tos 0x8]
23:32:34.301103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730191 3926083,nop,nop,sack 34356@16889 36452@16889> (DF) [tos 0x8]
23:32:34.831103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730244 3926083,nop,nop,sack 34356@16889 36976@16889> (DF) [tos 0x8]
23:32:35.021103 207.75.231.233.1075 > 207.75.226.50.20: . ack 64977 win 32488 <nop,nop,timestamp 730265 3926083,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]

The other end is receiving the other packets, but not the
retransmitted ones. If the SACK info here was correct we'd be able to
see which ones were received at each ACK. So since it never made it,
we eat a full retransmit timeout, which is several seconds at this
point. This is why the sender is not trying again after all these
acks, fast retransmit and SACK can handle only the first drop not
secondary ones.

23:33:08.261103 207.75.226.50.20 > 207.75.231.233.1075: P 64977:65501(524) ack 0 win 32488 <nop,nop,timestamp 3929640 730265> (DF) [tos 0x8]
23:33:09.231103 207.75.231.233.1075 > 207.75.226.50.20: . ack 65501 win 31964 <nop,nop,timestamp 733685 3929640,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]

Ok, the timeout retransmits the frame. Note the round trip is around
1 second here.

23:33:09.231103 207.75.226.50.20 > 207.75.231.233.1075: . 65501:66025(524) ack 0 win 32488 <nop,nop,timestamp 3929737 733685> (DF) [tos 0x8]
23:33:10.181103 207.75.231.233.1075 > 207.75.226.50.20: . ack 66025 win 31440 <nop,nop,timestamp 733781 3929737,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]
23:33:10.181103 207.75.226.50.20 > 207.75.231.233.1075: P 66025:66549(524) ack 0 win 32488 <nop,nop,timestamp 3929832 733781> (DF) [tos 0x8]
23:33:11.121103 207.75.231.233.1075 > 207.75.226.50.20: . ack 66549 win 30916 <nop,nop,timestamp 733875 3929832,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]
23:33:11.121103 207.75.226.50.20 > 207.75.231.233.1075: . 66549:67073(524) ack 0 win 32488 <nop,nop,timestamp 3929926 733875> (DF) [tos 0x8]
23:33:12.091103 207.75.231.233.1075 > 207.75.226.50.20: . ack 67073 win 30392 <nop,nop,timestamp 733972 3929926,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]
23:33:12.091103 207.75.226.50.20 > 207.75.231.233.1075: P 67073:67597(524) ack 0 win 32488 <nop,nop,timestamp 3930023 733972> (DF) [tos 0x8]
23:33:13.241103 207.75.231.233.1075 > 207.75.226.50.20: . ack 67597 win 29868 <nop,nop,timestamp 734087 3930023,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]
23:33:13.241103 207.75.226.50.20 > 207.75.231.233.1075: . 67597:68121(524) ack 0 win 32488 <nop,nop,timestamp 3930138 734087> (DF) [tos 0x8]
23:33:14.191103 207.75.231.233.1075 > 207.75.226.50.20: . ack 68121 win 29344 <nop,nop,timestamp 734182 3930138,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]
23:33:14.191103 207.75.226.50.20 > 207.75.231.233.1075: P 68121:68645(524) ack 0 win 32488 <nop,nop,timestamp 3930233 734182> (DF) [tos 0x8]
23:33:21.811103 207.75.231.233.1075 > 207.75.226.50.20: . ack 68645 win 28820 <nop,nop,timestamp 734277 3930233,nop,nop,sack 34356@16889 37500@16889> (DF) [tos 0x8]

The sender continues retransmitting the holes in the SACK information,
one by one. It will do this one by one, and this is the desired
behavior until the whole original window at the time of the original
loss event has been acknowledged.

23:33:21.811103 207.75.226.50.20 > 207.75.231.233.1075: . 68645:69169(524) ack 0 win 32488 <nop,nop,timestamp 3930995 734277> (DF) [tos 0x8]
23:33:22.801103 207.75.231.233.1075 > 207.75.226.50.20: . ack 72313 win 26724 <nop,nop,timestamp 735044 3930995> (DF) [tos 0x8]
23:33:22.801103 207.75.226.50.20 > 207.75.231.233.1075: P 72313:72837(524) ack 0 win 32488 <nop,nop,timestamp 3931094 735044> (DF) [tos 0x8]
23:33:24.231103 207.75.231.233.1075 > 207.75.226.50.20: . ack 72837 win 32488 <nop,nop,timestamp 735188 3931094> (DF) [tos 0x8]
23:33:24.231103 207.75.226.50.20 > 207.75.231.233.1075: P 72837:73361(524) ack 0 win 32488 <nop,nop,timestamp 3931237 735188> (DF) [tos 0x8]
23:33:25.671103 207.75.231.233.1075 > 207.75.226.50.20: . ack 73361 win 32488 <nop,nop,timestamp 735331 3931237> (DF) [tos 0x8]
23:33:25.671103 207.75.226.50.20 > 207.75.231.233.1075: . 73361:73885(524) ack 0 win 32488 <nop,nop,timestamp 3931381 735331> (DF) [tos 0x8]
23:33:27.101103 207.75.231.233.1075 > 207.75.226.50.20: . ack 73885 win 32488 <nop,nop,timestamp 735475 3931381> (DF) [tos 0x8]
23:33:27.101103 207.75.226.50.20 > 207.75.231.233.1075: P 73885:74409(524) ack 0 win 32488 <nop,nop,timestamp 3931524 735475> (DF) [tos 0x8]

It's almost done right here. About 3 or 4 packets later it would have
finished off the window by getting 75981 Ack'd. It should have began
to proceed normally at that point.

-----

Second situation:
Transfer was again going ok, but suddenly it dropped off to only one
packet every 2 minutes for almost half an hour! All the packets sent were
being acked, with the main ack counter updating. That can't be right.

True, it sounds really bad.
But note that from the logs I am seeing here, this connection is
shotty enough to obtain a 2 minute retransmit timeout from time to
time due to exponential backoff and the large amounts of loss.

23:35:50.681103 207.75.226.50.20 > 207.75.231.233.1075: . 239469:239993(524) ack 0 win 32488 <nop,nop,timestamp 3945882 749831> (DF) [tos 0x8]
23:35:51.271103 207.75.231.233.1075 > 207.75.226.50.20: . ack 209601 win 30392 <nop,nop,timestamp 749887 3945598,nop,nop,sack 58388@16891 2808@16892 47384@16891[|tcp]> (DF) [tos 0x8]
23:35:51.521103 207.75.231.233.1075 > 207.75.226.50.20: . ack 209601 win 30392 <nop,nop,timestamp 749915 3945623,nop,nop,sack 58388@16891 3332@16892 47384@16891[|tcp]> (DF) [tos 0x8]
23:35:51.791103 207.75.231.233.1075 > 207.75.226.50.20: . ack 210125 win 29868 <nop,nop,timestamp 749941 3945649,nop,nop,sack 58388@16891 3332@16892 47384@16891[|tcp]> (DF) [tos 0x8]
23:35:51.791103 207.75.226.50.20 > 207.75.231.233.1075: P 210125:210649(524) ack 0 win 32488 <nop,nop,timestamp 3945993 749941> (DF) [tos 0x8]

A large window of data went into the network from the sender, and the
receiver is indicating loss. Using the duplicate Acks and the SACK
information from the receiver, the sender enters fast retransmit mode
here at the end and retransmits the lowest unacknowledged frame. Note
how the retransmission occurred at time 23:35:51 at the sender.

23:35:52.011103 207.75.231.233.1075 > 207.75.226.50.20: . ack 210125 win 29868 <nop,nop,timestamp 749962 3945649,nop,nop,sack 58388@16891 3856@16892 47384@16891[|tcp]> (DF) [tos 0x8]

... MANY MANY acks of the same sequence number with SACK updates

23:35:54.571103 207.75.231.233.1075 > 207.75.226.50.20: . ack 212745 win 27248 <nop,nop,timestamp 750219 3945993,nop,nop,sack 58388@16891 8572@16892 47384@16891[|tcp]> (DF) [tos 0x8]

4 seconds later we get the retransmission acknowledged, so we'll try
to get the rest of the window sent out too.

23:35:54.571103 207.75.226.50.20 > 207.75.231.233.1075: . 212745:213269(524) ack 0 win 32488 <nop,nop,timestamp 3946271 750219> (DF) [tos 0x8]
23:36:00.641103 207.75.231.233.1075 > 207.75.226.50.20: . ack 223749 win 16244 <nop,nop,timestamp 750317 3946271,nop,nop,sack 58388@16891 8572@16892> (DF) [tos 0x8]

6 second round trip time, we're probably nearing the retransmit
timeout at this point.

23:36:00.641103 207.75.226.50.20 > 207.75.231.233.1075: P 223749:224273(524) ack 0 win 32488 <nop,nop,timestamp 3946878 750317> (DF) [tos 0x8]
23:36:03.361103 207.75.231.233.1075 > 207.75.226.50.20: . ack 231609 win 19912 <nop,nop,timestamp 751100 3946878,nop,nop,sack 188@16892 8572@16892> (DF) [tos 0x8]

3 second round trip.

23:36:12.811103 207.75.226.50.20 > 207.75.231.233.1075: P 231609:232133(524) ack 0 win 32488 <nop,nop,timestamp 3948095 751100> (DF) [tos 0x8]
23:36:13.791103 207.75.231.233.1075 > 207.75.226.50.20: . ack 232133 win 31440 <nop,nop,timestamp 752143 3948095,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:36:13.791103 207.75.226.50.20 > 207.75.231.233.1075: . 232133:232657(524) ack 0 win 32488 <nop,nop,timestamp 3948193 752143> (DF) [tos 0x8]
23:36:14.751103 207.75.231.233.1075 > 207.75.226.50.20: . ack 232657 win 31440 <nop,nop,timestamp 752239 3948193,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:37:54.111103 207.75.226.50.20 > 207.75.231.233.1075: P 232657:233181(524) ack 0 win 32488 <nop,nop,timestamp 3958225 752239> (DF) [tos 0x8]
23:37:55.081103 207.75.231.233.1075 > 207.75.226.50.20: . ack 233181 win 31440 <nop,nop,timestamp 762273 3958225,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:39:55.081103 207.75.226.50.20 > 207.75.231.233.1075: . 233181:233705(524) ack 0 win 32488 <nop,nop,timestamp 3970322 762273> (DF) [tos 0x8]
23:39:57.261103 207.75.231.233.1075 > 207.75.226.50.20: . ack 233705 win 31440 <nop,nop,timestamp 774370 3970322,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:41:57.261103 207.75.226.50.20 > 207.75.231.233.1075: P 233705:234229(524) ack 0 win 32488 <nop,nop,timestamp 3982540 774370> (DF) [tos 0x8]
23:42:00.541103 207.75.231.233.1075 > 207.75.226.50.20: . ack 234229 win 31440 <nop,nop,timestamp 786588 3982540,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:44:00.541103 207.75.226.50.20 > 207.75.231.233.1075: . 234229:234753(524) ack 0 win 32488 <nop,nop,timestamp 3994868 786588> (DF) [tos 0x8]
23:44:02.811103 207.75.231.233.1075 > 207.75.226.50.20: . ack 234753 win 31440 <nop,nop,timestamp 799050 3994868,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:46:02.811103 207.75.226.50.20 > 207.75.231.233.1075: P 234753:235277(524) ack 0 win 32488 <nop,nop,timestamp 4007095 799050> (DF) [tos 0x8]
23:46:03.871103 207.75.231.233.1075 > 207.75.226.50.20: . ack 235277 win 31440 <nop,nop,timestamp 811156 4007095,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:48:03.871103 207.75.226.50.20 > 207.75.231.233.1075: . 235277:235801(524) ack 0 win 32488 <nop,nop,timestamp 4019201 811156> (DF) [tos 0x8]
23:48:04.931103 207.75.231.233.1075 > 207.75.226.50.20: . ack 235801 win 31440 <nop,nop,timestamp 823262 4019201,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]
23:50:04.931103 207.75.226.50.20 > 207.75.231.233.1075: P 235801:236325(524) ack 0 win 32488 <nop,nop,timestamp 4031307 823262> (DF) [tos 0x8]
23:50:05.881103 207.75.231.233.1075 > 207.75.226.50.20: . ack 236325 win 31440 <nop,nop,timestamp 835359 4031307,nop,nop,sack 712@16892 8572@16892> (DF) [tos 0x8]

Once again, the transfer won't try to get back to full speed until the
sender gets the data frame with sequence number 239469 acknowledged.
Your trace cuts off before that point so I can't tell if it exited
loss recover correctly or not.

I honestly can't see TCP doing anything wrong in these dumps, they are
all explainable. Really, you can't expect much with 6 second or
longer round trip delays and such small windows.

One thing you can try to alleviate this situation is to up the default
window sizes a bit using the knobs in:

/proc/sys/net/core/{r,w}mem_{max,default}

Later,
David S. Miller
davem@dm.cobaltmicro.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html