[PATCH] Re: Very poor TCP/SACK performance

Vijay G. Bharadwaj (vgb@isr.umd.edu)
Tue, 8 Sep 1998 18:34:40 -0400 (EDT)


On Tue, 8 Sep 1998, David S. Miller wrote:

> Date: Mon, 7 Sep 1998 01:35:28 -0400 (EDT)
> From: Jeff DeFouw <mrj@i2k.com>
>
> I noticed two different problems while a friend was receiving a file
> from my system. The first one might not be a real problem, I don't know
> exactly how SACK resends are handled. The second problem looks like a
> definite bug in TCP somewhere.
>
> Thanks for making this report, I'll be looking into it.
> But if you provide more dumps to help me debug this problem could
> you please rebuild tcpdump with the patch I have appended at the end?
> The stock tcpdump does not output SACK information from TCP packets
> properly without the patch. The stock tcpdump uses the pre-RFC format
> of the SACKS which is nothing like real modern SACKs in use today :-)

I saw this sort of problem in my work over the summer, and I fixed it as
well. It's a couple of one-liners. I found three problems which caused bad
performance (forgive me, I haven't looked at this trace close enough to
tell if both problems are here). I've been planning to send this in for a
couple of days, I just haven't had the time...

The patch is at the end. It fixes three things in tcp_input.c. In order:

Problem 1: Incoming TCP SACK blocks handled incorrectly

Incoming SACK info was almost never being used, and at most one SACK block
was used. The check for short-circuiting the loop was wrong. The old check
broke the loop without marking any packets as SACKed unless the SACK block
started somewhere in the first skb on the retransmit queue. This almost
never happens because if the first skb on the queue was being ACKed, why
would it be in a SACK block? I changed the check to break only if we have
marked the last full skb that was SACKed by this block, which I believe
was the intention.

Problem 2: dupacks not counted correctly

The code for counting dupacks was wrong. If SACK was being used, then the
first dupack would cause dup_acks to be incremented by 2 (if fackets_out
was greater than 3, which happens if a burst of packets is lost, or if an
ACK packet is lost). Also, high_seq would be set in this case, preventing
further dupacks from incrementing the dup_ack counter. In the old code,
fackets_out was never set anyways because of problem 1, so this never
happened.

Problem 3: Zero window probe handling broken

The RFC (1122, IIRC) says that zero window probes should contain one byte
of garbage data with a stale sequence number. However, both Solaris and
Windows send a zero-window probe consisting of one byte of new data. Since
pred_flags match up, Linux happily accepts this byte. Now
tcp_receive_window() returns -1, so the Linux box advertises a 64K window,
and things go downhill from there. My fix throws away the extra byte. This
also enabled me to delete another check a little further down in the code.

This change violates the RFC though, because the RFC (I think it's 1122,
or maybe it's 793) states that a TCP MUST always be able to accept one
additional byte of data. However I think this is a bug in the RFC, because
if a TCP conforms to this, then I can still send it a steady stream of
1-byte packets, even if it's advertizing a zero window, and the TCP must
accept and buffer this data.

I notice that tcp_receive_window() now checks for (win < 0), which also
fixes the above problem (at least the 64K window is not advertised).
However, it still means that the remote end can pass us data in 1-byte
packets, which is terrible for network performance.

Regrads,

-Vijay

--- linux-2.1.120/net/ipv4/tcp_input.c.orig Tue Sep 8 17:47:05 1998
+++ linux-2.1.120/net/ipv4/tcp_input.c Tue Sep 8 17:57:10 1998
@@ -292,7 +292,7 @@
/* The retransmission queue is always in order, so
* we can short-circuit the walk early.
*/
- if(!before(start_seq, TCP_SKB_CB(skb)->end_seq))
+ if(after(TCP_SKB_CB(skb)->end_seq, end_seq))
break;

/* We play conservative, we don't allow SACKS to partially
@@ -471,10 +471,9 @@
* to one half the current congestion window, but no less
* than two segments. Retransmit the missing segment.
*/
+ tp->dup_acks++;
if (tp->high_seq == 0 || after(ack, tp->high_seq)) {
- tp->dup_acks++;
if ((tp->fackets_out > 3) || (tp->dup_acks == 3)) {
- tp->dup_acks++;
tp->snd_ssthresh = max(tp->snd_cwnd >> (TCP_CWND_SHIFT + 1), 2);
tp->snd_cwnd = (tp->snd_ssthresh + 3) << TCP_CWND_SHIFT;
tp->high_seq = tp->snd_nxt;
@@ -1712,6 +1711,10 @@
*/

if (flg == tp->pred_flags && TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+ if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
+ tcp_send_ack(sk);
+ goto discard;
+ }
if (len <= th->doff*4) {
/* Bulk data transfer: sender */
if (len == th->doff*4) {
@@ -1726,13 +1729,8 @@
}
} else if (TCP_SKB_CB(skb)->ack_seq == tp->snd_una) {
/* Bulk data transfer: receiver */
- if (atomic_read(&sk->rmem_alloc) > sk->rcvbuf) {
- /* We must send an ACK for zero window probes. */
- if (!before(TCP_SKB_CB(skb)->seq,
- tp->rcv_wup + tp->rcv_wnd))
- tcp_send_ack(sk);
+ if (atomic_read(&sk->rmem_alloc) > sk->rcvbuf)
goto discard;
- }

__skb_pull(skb,th->doff*4);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html