RTO [was Re: my broken TCP is faster on broken networks [Re: Very poor TCP/SACK performance]]

Andrea Arcangeli (andrea@e-mind.com)
Sat, 12 Sep 1998 17:20:56 +0200 (CEST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Yiannis Mavroukakis: "Re: Bug in mounting routines of atapi-cdroms"
Previous message: Andrea Arcangeli: "Re: my broken TCP is faster on broken networks"

On Fri, 11 Sep 1998, Savochkin Andrey Vladimirovich wrote:

>The kernel calculates mean deviation as an approximation for
> M | rtt - M rtt |,

OK thanks.

>and the mean RTT. The calculation is coded and supplied by good comments in
>tcp_rtt_estimator() in tcp_input.c.

Good comments if you just know that srtt contains 8*real_srttt and mdev
really is 4*real_mdev ;-).

>Well, I've done the check of the RTO evaluation. I added a debug output

The only whole math algoritm I can read to calculate the RTO is from
rfc793 and that algorithm would compute a very very smaller rto than the
one in the current kernel:

----
An Example Retransmission Timeout Procedure

Measure the elapsed time between sending a data octet with a
particular sequence number and receiving an acknowledgment that
covers that sequence number (segments sent do not have to match
segments received). This measured elapsed time is the Round Trip
Time (RTT). Next compute a Smoothed Round Trip Time (SRTT) as:

SRTT = ( ALPHA * SRTT ) + ((1-ALPHA) * RTT)

and based on this, compute the retransmission timeout (RTO) as:

RTO = min[UBOUND,max[LBOUND,(BETA*SRTT)]]

where UBOUND is an upper bound on the timeout (e.g., 1 minute),
LBOUND is a lower bound on the timeout (e.g., 1 second), ALPHA is
a smoothing factor (e.g., .8 to .9), and BETA is a delay variance
factor (e.g., 1.3 to 2.0).
----

Yes I just read in 1122 that:

--- rfc 1122 ---
DISCUSSION:
There were two known problems with the RTO calculations
specified in RFC-793. First, the accurate measurement
of RTTs is difficult when there are retransmissions.
Second, the algorithm to compute the smoothed round-
trip time is inadequate [TCP:7], because it incorrectly

Internet Engineering Task Force [Page 95]

RFC1122 TRANSPORT LAYER -- TCP October 1989

assumed that the variance in RTT values would be small
and constant. These problems were solved by Karn's and
Jacobson's algorithm, respectively.
-------

But where can I find such Karn's and Jacobson's algorithm? I hope not
having to spend money...

The other algorithm I know to compute the RTO is the one you said in your
previous email in the thread:

------ mail -----
As far as I remember the value of the timeout is set to
round trip time plus its mean deviation multiplied by 4.
------ mail -----

And the kernel compute the rto exactly in this manner but then increase
the rto of (rto/4 + rto >> (varial number that can be 1 at connection
startup)).

With the current code when I start a connection that has a rtt of 143 (and
usually with the unknown snd_cwnd = 2) I get a final rto of 1597. So for a
route with a mean rtt of 1.5 sec that drop many packets, the kernel will
retransmit the packet only after 15 sec.

Since I can' t develop something against RFCs (because the algorithm it' s
not mentioned in the RFC ;-),

(
andrea@dragon:/tmp$ zegrep -il "Karn.*algorithm" /usr/doc/doc-rfc/all-rfcs/*.gz
/usr/doc/doc-rfc/all-rfcs/rfc1055.txt.gz [slip RFC]
/usr/doc/doc-rfc/all-rfcs/rfc1122.txt.gz [see up]
)

I think that it' s really excessive increase the srtt of 4*mdev. Also I
think that it' s completly broken start with a mdev of 3/4 sec (that then
icrease of at least 3 sec the first retransmit timeout).

So I think that we should default to 0 mdev at sock creation time and
caculate the rto as the SRTT + MDEV (not 4*MDEV) changing

tp->rto = (tp->srtt >> 3) + tp->mdev;
to
tp->rto = (tp->srtt >> 3) + (tp->mdev >> 2);

I have not done math to prove that this changes would not collapse the
whole Internet (because I am not able to do that ;-), but sound obvious to
me that a first retransmit timeout of ~15 times > of the first rtt is
wrong.

With my changes if the first rtt is of 143 the first rto become 319 (>2
rtt). This would fit in the RFC793 algorithm and would get the
improvements of the mdev and congestion avoidance calculations.

I append a simple proggy that done the calc for me (#define STOCK 1 to get
the 2.1.121 calculation and STOCK 0 to get _my_ calc).

Andrea[s] Arcangeli

/*
* Little proggy to calc the RTO. STOCK 1 does the kernel 2.1.121 calculation
* STOCK 0 does my own calc.
* Most of the code is taken from Linux Kernel 2.1.121 and this whole program
* is under licence GPL.
* Andrea Arcangeli
*
* Compile with:
* gcc -O2 calc_rto.c -I /usr/src/linux/include/ -D__KERNEL__
*/

#include <net/tcp.h>

#define STOCK 0

#define printk printf

static void tcp_rtt_estimator(struct tcp_opt *tp, __u32 mrtt)
{
long m = mrtt; /* RTT */

printk("tcp_rtt_estimator from rtt %d, srtt %d, mdev %d, "
"rto %d ", mrtt, tp->srtt,
tp->mdev, tp->rto);
/* The following amusing code comes from Jacobson's
* article in SIGCOMM '88. Note that rtt and mdev
* are scaled versions of rtt and mean deviation.
* This is designed to be as fast as possible
* m stands for "measurement".
*
* On a 1990 paper the rto value is changed to:
* RTO = rtt + 4 * mdev
*/
if(m == 0)
m = 1;
if (tp->srtt != 0) {
m -= (tp->srtt >> 3); /* m is now error in rtt est */
tp->srtt += m; /* rtt = 7/8 rtt + 1/8 new */
if (m < 0)
m = -m; /* m is now abs(error) */
m -= (tp->mdev >> 2); /* similar update on mdev */
tp->mdev += m; /* mdev = 3/4 mdev + 1/4 new */
} else {
/* no previous measure. */
tp->srtt = m<<3; /* take the measured time to be rtt */
#if STOCK
tp->mdev = m<<2; /* make sure rto = 3*rtt */
#else
tp->mdev = 0; /* make sure rto = 3*rtt */
#endif
}
printk("to mdev = %d, srtt = %d\n",
tp->mdev, tp->srtt);
}

/* Calculate rto without backoff. This is the second half of Van Jacobson's
* routine referred to above.
*/

/* static __inline__ void tcp_set_rto(struct tcp_opt *tp) */
static void tcp_set_rto(struct tcp_opt *tp)
{
#if STOCK
tp->rto = (tp->srtt >> 3) + tp->mdev;
#else
tp->rto = (tp->srtt >> 3) + (tp->mdev >> 2);
#endif
tp->rto += (tp->rto >> 2) + (tp->rto >> ((tp->snd_cwnd>>TCP_CWND_SHIFT)-1));
printk("tcp_set_rto rto = %d, snd_cwnd %d\n",
tp->rto, tp->snd_cwnd);
}

/* Keep the rto between HZ/5 and 120*HZ. 120*HZ is the upper bound
* on packet lifetime in the internet. We need the HZ/5 lower
* bound to behave correctly against BSD stacks with a fixed
* delayed ack.
* FIXME: It's not entirely clear this lower bound is the best
* way to avoid the problem. Is it possible to drop the lower
* bound and still avoid trouble with BSD stacks? Perhaps
* some modification to the RTO calculation that takes delayed
* ack bias into account? This needs serious thought. -- erics
*/
/* static __inline__ void tcp_bound_rto(struct tcp_opt *tp) */
static void tcp_bound_rto(struct tcp_opt *tp)
{
printk("tcp_bound_rto from rto = %d, ",
tp->rto);
if (tp->rto > 120*HZ)
tp->rto = 120*HZ;
if (tp->rto < HZ/5)
tp->rto = HZ/5;
printk("to rto = %d\n", tp->rto);
}

static void calc(struct tcp_opt *tp, __u32 rtt)
{
tcp_rtt_estimator(tp, rtt);
tcp_set_rto(tp);
tcp_bound_rto(tp);
}

main()
{
int rtt = 142;
struct tcp_opt tp;
tp.srtt = 0;
tp.mdev = 300;
tp.rto = 300;
tp.snd_cwnd = 2;

calc(&tp, rtt);

tp.snd_cwnd = 4;
rtt = 100;

calc(&tp, rtt);

#if 0
for (;;)
calc(&tp, rtt);
#endif
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html

Next message: Yiannis Mavroukakis: "Re: Bug in mounting routines of atapi-cdroms"
Previous message: Andrea Arcangeli: "Re: my broken TCP is faster on broken networks"