[Changelog] - Potential performance bottleneck for Linxu TCP

From: Wenji Wu
Date: Wed Nov 29 2006 - 18:28:22 EST



From: Wenji Wu <wenji@xxxxxxxx>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the Changelog for the patch

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@xxxxxxxx
(O): 001-630-840-4541

From: Wenji Wu <wenji@xxxxxxxx>

- Subject

Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop)


- Why the kernel needed patching

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket will
be locked. During the period, all the incoming packet for the TCP socket will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog queue
will not be TCP processed till the network applicaton resume its execution. If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

- The overall design apparoch in the patch

the underlying idea here is that when there are packets waiting on the prequeue
or backlog queue, do not allow the data receiving process to release the CPU for long.

- Implementation details

We have modified the Linux process scheduling policy and tcp_recvmsg().

To summarize, the solution works as follows:

an expired data receiving process with packets waiting on backlog queue or
prequeue is moved to the active array, instead of expired array as usual.
More often than not, the expired data receiving process will continue to run.
Even it doesn?t, the wait time before it resumes its execution will be greatly reduced.
However, this gives the process extra runs compared to other processes in the runqueue.

For the sake of fairness, the process would be labeled with the extra_run_flag.

Also considering the facts that:

(1) the resumed process will continue its execution within tcp_recvmsg();
(2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained.

For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog
queue are drained and before tcp_recvmsg() returns to user space, any process labeled with
the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue.
yield() works by removing the process from the active array (where it current is, because it is running),
and inserting it into the expired array. Also, to prevent processes in the expired array from starving,

A special rule has been provided for Linux process scheduling (the same rule used for interactive processes):
an expired process is moved to the expired array without respect to its status if processes in the expired array are starved.

Changed files:

/kernel/sched.c
/kernel/fork.c
/include/linux/sched.h
/net/ipv4/tcp.c

- Testing results

The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck.
The proposed solution won?t cause serious fairness issue.

The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop