my broken TCP is faster on broken networks [Re: Very poor TCP/SACK performance]

Andrea Arcangeli (andrea@e-mind.com)
Thu, 10 Sep 1998 03:00:45 +0200 (CEST)


On 9 Sep 1998, Mark Gray wrote:

>OpenBSD uses a window of 16384 and I find that that gives me a very
>good transfer rate with linux-2.0.* without ever causing stalls which
>will occur off and on when using the default Window.

I can reproduce the stall very very well in ppp. The problem is that some
network from here to my Unversity are really congestioned, I think that
because they drop many many TCP packets (or transmit it very out of
order after a lot of seconds).

The _only_ way to increase performance is to decrease the tp->rto. I tried
force the kernel to retransmit the packet very fast (sure in less time
than the rtt hehehe ;-) and my connection to the University now is
perfectly responsive (cool ;-). I don' t stall anymore. Before my hack to
tcp_reset_xmit_timer() I had to wait 10/15 sec before ssh asks me the
password. Now I get the password prompt after some sec (and my RX modem
line is always ON ;-). The interactive performance are perfect now with my
brute hack.

This is a tcpdump before the hack:

15:55:58.930956 195.223.140.68.1023 > 130.136.3.110.ssh: S 2050156785:2050156785(0) win 32767 <mss 1460,sackOK,timestamp 1045254 0,nop,wscale 4> (DF)
15:56:01.923811 195.223.140.68.1023 > 130.136.3.110.ssh: S 2050156785:2050156785(0) win 32767 <mss 1460,sackOK,timestamp 1045554 0,nop,wscale 4> (DF)
15:56:07.924031 195.223.140.68.1023 > 130.136.3.110.ssh: S 2050156785:2050156785(0) win 32767 <mss 1460,sackOK,timestamp 1046154 0,nop,wscale 4> (DF)
15:56:08.894043 130.136.3.110.ssh > 195.223.140.68.1023: S 1765296980:1765296980(0) ack 2050156786 win 16352 <mss 1460>

I had to wait 10 sec to get the SYN-ACK from the server!!!! The bad thing
of my connection is that it seems that the most droppet packet are small
packet (like SYN-ACK or small login pushed packet used in interactive
shell use). TCP used to transfer web pages and ftp files goes very better.

15:56:08.894133 195.223.140.68.1023 > 130.136.3.110.ssh: . ack 1 win 65160 (DF)
15:56:11.924153 130.136.3.110.ssh > 195.223.140.68.1023: S 1765296980:1765296980(0) ack 2050156786 win 16352 <mss 1460>

This is the SYN-ACK for the second SYN sent that got out of order. It
seems that sometimes the connection is responsive, sometimes is slow and
sometimes it drops everything.

15:56:11.924224 195.223.140.68.1023 > 130.136.3.110.ssh: . ack 1 win 65160 (DF) [tos 0x10]
15:56:12.484182 130.136.3.110.ssh > 195.223.140.68.1023: P 1:16(15) ack 1 win 16352 (DF) [tos 0x10]
15:56:12.485033 195.223.140.68.1023 > 130.136.3.110.ssh: P 1:16(15) ack 16 win 65160 (DF) [tos 0x10]
15:56:13.124208 130.136.3.110.ssh > 195.223.140.68.1023: P 16:292(276) ack 16 win 16352 (DF) [tos 0x10]
15:56:13.136426 195.223.140.68.1023 > 130.136.3.110.ssh: P 16:172(156) ack 292 win 65160 (DF) [tos 0x10]
15:56:16.164325 130.136.3.110.ssh > 195.223.140.68.1023: P 16:292(276) ack 16 win 16352 (DF) [tos 0x10]
15:56:16.164390 195.223.140.68.1023 > 130.136.3.110.ssh: . ack 292 win 65160 (DF) [tos 0x10]

Here TCP of 2.1.120 wait 4 sec to retransmit 16:172. Why should it wait 4
sec if the normal rtt (measured for example between the sending of 1:16
and its in order aking is of 13.124208 - 12.485033 = 0.5sec?

For what I understand from the code rto is been set to 4 sec but instead
rto should be close to 0.5 sec! I have not read RFC very very well (they
talk about some algorithm to use to calc rto) but my intuiction say me
that to be fast and efficient in every kind of network, TCP has to
retransmit the not acked packets a bit after the real rtt of the route
(this at least without sack).

15:56:20.334521 195.223.140.68.1023 > 130.136.3.110.ssh: P 16:172(156) ack 292 win 65160 (DF) [tos 0x10]
15:56:21.324529 130.136.3.110.ssh > 195.223.140.68.1023: . ack 172 win 16352 (DF) [tos 0x10]
15:56:21.484539 130.136.3.110.ssh > 195.223.140.68.1023: P 292:304(12) ack 172 win 16352 (DF) [tos 0x10]
15:56:21.485785 195.223.140.68.1023 > 130.136.3.110.ssh: P 172:200(28) ack 304 win 65160 (DF) [tos 0x10]
15:56:22.164545 130.136.3.110.ssh > 195.223.140.68.1023: . ack 200 win 16352 (DF) [tos 0x10]
15:56:22.991633 195.223.140.68.1023 > 130.136.3.110.ssh: F 200:200(0) ack 304 win 65160 (DF) [tos 0x10]
15:56:23.194593 130.136.3.110.ssh > 195.223.140.68.1023: P 304:316(12) ack 200 win 16352 (DF) [tos 0x10]
15:56:23.194699 195.223.140.68.1023 > 130.136.3.110.ssh: R 2050156985:2050156985(0) win 0 [tos 0x10]
15:56:23.574608 130.136.3.110.ssh > 195.223.140.68.1023: . ack 201 win 16351 (DF) [tos 0x10]
15:56:23.574671 195.223.140.68.1023 > 130.136.3.110.ssh: R 2050156986:2050156986(0) win 0 [tos 0x10]
15:56:23.584616 130.136.3.110.ssh > 195.223.140.68.1023: F 316:316(0) ack 201 win 16352 [tos 0x10]
15:56:23.584654 195.223.140.68.1023 > 130.136.3.110.ssh: R 2050156986:2050156986(0) win 0 [tos 0x10]
15:56:26.324702 130.136.3.110.ssh > 195.223.140.68.1023: P 304:316(12) ack 201 win 16352 (DF) [tos 0x10]
15:56:26.324769 195.223.140.68.1023 > 130.136.3.110.ssh: R 2050156986:2050156986(0) win 0 [tos 0x10]

So I fucked TCP and now I retransmit many many packet duplicates also
before rtt is passed just because I am sure many of them will be
dropped by the evil network ;-).

This hack would collapse again more just collapsed networks if everybody
would use my idea. It works great if I am the only guy out there that
retransmit every packet many many times without bothering about the rtt
;-).

16:23:06.123806 195.223.140.63.1023 > 130.136.3.110.ssh: S 3758917761:3758917761(0) win 32120 <mss 1460,sackOK,timestamp 8113 0,nop,wscale 0> (DF)
16:23:06.221287 195.223.140.63.1023 > 130.136.3.110.ssh: S 3758917761:3758917761(0) win 32120 <mss 1460,sackOK,timestamp 8123 0,nop,wscale 0> (DF)
16:23:06.421291 195.223.140.63.1023 > 130.136.3.110.ssh: S 3758917761:3758917761(0) win 32120 <mss 1460,sackOK,timestamp 8143 0,nop,wscale 0> (DF)
16:23:06.821322 195.223.140.63.1023 > 130.136.3.110.ssh: S 3758917761:3758917761(0) win 32120 <mss 1460,sackOK,timestamp 8183 0,nop,wscale 0> (DF)
16:23:07.621354 195.223.140.63.1023 > 130.136.3.110.ssh: S 3758917761:3758917761(0) win 32120 <mss 1460,sackOK,timestamp 8263 0,nop,wscale 0> (DF)
16:23:07.731330 130.136.3.110.ssh > 195.223.140.63.1023: S 893569343:893569343(0) ack 3758917762 win 16352 <mss 1460>
16:23:07.731438 195.223.140.63.1023 > 130.136.3.110.ssh: . ack 1 win 32120 (DF)

This is the tcpdump of my first hack of TCP. You can see that _only_
changing TCP_TIMEOUT_INIT from 3sec to 100msec (wihtout touching the
xmit_reset_timer routine) I get the SYN-ACK after 1 sec instead of 10 sec.
*10 improvement! Great!

Is there a way to do something like my hack without break TCP? Is this the
RFC1122 part that avoid us to use a starting rto lower than 3sec?

--- rfc1122 line 5626 ---
The following values SHOULD be used to initialize the
estimation parameters for a new connection:

(a) RTT = 0 seconds.

(b) RTO = 3 seconds. (The smoothed variance is to be
initialized to the value that will result in this RTO).
--- rfc1122 ---

Could we add a flag to enable on some some route (something like firewall
settings) that, if set, it would breaks TCP and avoids stalls
retransmitting things many times without wait for an eventually long rto?
This new flag, if set in a reliable network connection, would decrease
performance and so it would be not used in good network. It' s like to say
"if you have a congestioned network and you don' t care, be quiet, linux
users will be the only one that will run fast on it, congestioning it
again more?".

Also I suspect about the rto calculation since if I understand things
right (also in an intuitive way), rto should be more equal possible to the
real rtt of the route. For example at my University nomachines uses
timestamps option of TCP (yes yes I really read a lot of RFC today ;-) so
the 2.1.120 TCP code should calc the rtt from the acked packets not out of
order. When happens that the network works (no dropped of out of order
packets) the rtt is of as worse one second. Since we retransmit after a
tp->rto, I think that the rto should be close to the real rtt of the route
and it isn' t since my stalls are longer than one second (when the
connection is just ESTABLISHED of course, I know that the SYN has to
happen after 3 sec (and btw it happens after 3sec... 6sec.... and not
after 3sec...3sec...3sec...))?

I just added some debugging code to the rto calculation. Tomorrow I' ll
boot the rto-debug-kernel and I' ll do some try (the only annoying thing
is that to do every try on the evil route from here to my University I
have to pay for the phone because I can' t reproduce the dropped packets
locally... :-().

Andrea[s] Arcangeli

PS. I hope to have not made too much confusion...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html