Re: OFFTOPIC: Regarding NT vs Linux

David S. Miller (davem@jenolan.rutgers.edu)
Wed, 24 Sep 1997 20:39:51 -0400


Date: Wed, 24 Sep 1997 03:46:38 -0700 (PDT)
From: Dean Gaudet <dgaudet-list-linux-kernel@arctic.org>

Interesting ... but does it require you to use those non-portable
constructs? Or will a more traditional mmap() approach do the job
as well?

Solaris 2.6 w/sun's ATM card also does zero-copy TCP, I believe
it's main requirement is that you write() a multiple of 16k from a
mmap()d file. It unfortunately won't do it if you use writev().
(Think http headers, and the first chunk of the file ... apache 1.3
does this.)

I think people should concentrate on the transmit case, since it is
the simplest. I've looked at it, and it really is trivial, the thing
you have to keep in mind is not to try too hard to perform the
optimization because for most of the cases that matter you easily hit
the constraints.

Here is my picture of a copy-on-write transmit path.

Each SKB has two modes of operation, a "traditional" mode (ie. how
they work right now) and a "sharing" mode where you roughly have a ptr
to the protocol header in kernel space and pointer(s) to the user
buffer data that gets slapped at the end.

Consider the fact that the machines where you'd actually gain from
this have network cards which can handle more than one iovec type
entry for a single transmission, and optionally might even checksum
the thing for you on the way out, this can work.

Once the send path decides it is about to send some bytes onto the
wire it goes:

1) Determine route for destination (we usually have this handy already
at this point).

2) From that get the device it will go out on.

3) See if device can handle more than one buffer pointer for a single
transmitted frame.

4) If it can:
a) Use usual TCP algorithms to determine how much we should
send
b) Based upon that see if the number of page crossings do
not exceed the number of buffer pieces the device can
handle minus one (the minus one is for the header, which
why the device must support at least two to get here).
c) If not go to (5)
d) Mark all the relevant user pages copy on write, this
pins them down.
e) Allocate new style skb, with a small kernel buffer attached
for building the headers.
f) Attach the iovec of ptr/len elements to the user data into
the new style SKB.
g) Build TCP header, if device can do the outgoing checksum
then indicate in the skb where the checksum should go
else checksum the packet.
h) Send it off

5) If it can't, get a traditional SKB and copy all the data and build
the header in there.

On ack of the transmission

1) If was a new style SKB, unpin all the user pages, perform any
needed wakeups (not normal protocol ones, ones resulting from
faults on the pinned pages if any).

2) Normal processing, try to transmit more data if any.

It is even a win without hardware checksums. Even if the performance
is the same, you're saving memory. But I think it will be faster as
well because in many cases the user's copy will be in the L2 cache so
the checksum will be taking hits.

Later,
David "Sparc" Miller
davem@caip.rutgers.edu