Re: IP Checksumming

Philippe Strauss (philou@sicel-home-1-4.urbanet.ch)
Fri, 22 Nov 1996 00:05:08 +0100 (MET)


I forward something of interest for you guys, that I've
found on comp.arch, it talks specifically of pentiums
optimizations.

-------------------------------------------------------

From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Newsgroups: comp.lang.c,comp.arch
Subject: [++] TCPI/IP checksum (Was:Re: Plenty of Registers?)
Date: Wed, 06 Nov 1996 20:58:38 +0100
Organization: Hydro
Lines: 79
Message-ID: <3280EDEE.31DD@hda.hydro.com>
References: <548dra$sdk@bcrkh13.bnr.ca> <55agnr$qur@news1.mnsinc.com> <55hsl3$28h@linux.cs.Helsinki.FI> <55m4oq$7rt@news1.mnsinc.com> <1996Nov603.17.23.1378@koobera.math.uic.edu> <328085C8.12EC@ran.es>
NNTP-Posting-Host: 136.164.13.18
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 3.0Gold (Win95; I)
Xref: span.ch comp.lang.c:193371 comp.arch:67573

Colin Dooley wrote:
>
> D. J. Bernstein wrote:
> >
> > Szu-Wen Huang <huang@mnsinc.com> wrote:
> > > How much effort would be required
> > > in micro-optimizations to achieve doubled speed performance in (let's say)
> > > the TCP/IP stack?
> >
> > Funny you mention that.
> >
> > One of the biggest time-sinks in a TCP/IP stack is the TCP/IP checksum.
> > The TCP/IP checksum was popularized a while back by Terje Mathisen and
> > Michael Abrash as a beautiful target for Pentium optimization.
> >
> > You have to handle 4 bytes at a time, not 2; watch out for 4-byte
> > alignment (can't be done in C without unportable pointer-to-int casts),
> > use ADC (which current C compilers don't provide to the programmer), and
> > schedule well.
> >
> > Terje's code runs _ten times faster_ than straightforward C code.
> >
>
> But are you saying that it would be impossible to write a decent
> version in C, or just that this particular version could be made
> 10 times faster (and only on a Pentium)?

The TCP/IP checksum is a very beautiful algorithm, mainly because it is
bi-endian:

Split a packet into 16-bit blocks and calculate the 16-bit checksum,
while wrapping all carries back into the sum.

There are many ways to get decent speed for this algorithm, but the last
factor of two really requires a carry flag that the programmer can use.

I.e. the fastest possible C code can process 16 bits in each iteration,
if running on a 32-bit cpu, 32 bits on a 64-bit cpu.

If you have conditionally executed opcodes, or a conditional move, then
you can gain back the full register width, i.e. you can do something
like this efficiently:

unsigned long data[];
...
unsigned long sum, new_sum;

for (i = sum = 0; i < len; i++) {
new_sum = sum + data[i];
if (new_sum < sum) /* Did we get carry/overflow from the add? */
new_sum++;
sum = new_sum;
}

If you have to do the carry propagation with compare & branch, then it
will probably be faster to process half the number of bits/iteration:

unsigned short data[];
...
unsigned long sum, new_sum;

for (i = sum = 0; i < len; i += max) {
max = (len - i < 65536l)? len : i+65535l; /* Make sure that we won't
overflow! */
for ( ; i < max; i++) {
sum += data[i];
}
sum = (sum & 0xffff) + (sum >> 16);
}
sum = (sum & 0xffff) + (sum >> 16);

If no packet can be greater than 128KB, then the double loop above can
be shortened to just one.

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

---------------------------------------------------------------

-- 
Philippe Strauss, CH-1092 Belmont

Email: <philippe.strauss@urbanet.ch> Homepage: http://sicel-home-1-4.urbanet.ch