Re: IP Checksumming

Tom May (ftom@netcom.com)
20 Nov 1996 20:08:18 -0800


"Richard B. Johnson" <root@analogic.com> writes:

> I can't figure out the %)&^$ GNU pseudo assembly that thinks that
> all assembly is written like a 68k (source->dest). However, using
> Intel numonics, the following checksums an IP packet the fastest.
> It can also copy while checksumming.
>
> It minimizes the number of jumps which flush prefetch buffers and
> messes up the speed.

What are you talking about? Your code does one jump per word. The
existing code does one jump per 16 words.

> I have watched Linux use different checksum methods throughout the
> years and it seems that things are getting worse.

They're not. I've been responsible for a number of the recent changes
and I can assure you that I always run a number of tests with various
sizes, alignments, and caching on a 486/66 and my crappy Pentium
system. It has been getting faster.

> If someone could convert this to the GNU stuff without breaking the
> logic, I know it would perform MUCH better than the present kernel
> IP checksums even though it uses words for memory access. The present
> checksum routines try several tricks.

The tricks are loop unrolling and minimizing Pentium pipeline stalls.
They work.

> However, every time a branch
> on compare occurs, there is an enormous penality because the prefetch
> buffer must be flushed.

Not true with Pentium branch prediction. I know how the Pentium
algorithm works and it will predict our branches correctly in almost
all cases. The bad thing about the branch instruction on a Pentium is
that it must be executed at all.

> Further, the present routines in ../../asm
> don't take advantage of the Intel architecture.

You mean the lodsw and loop instructions? Those have been losers
since the 386.

> I am sure that if someone takes the time to convert this stuff, they
> will be very pleasantly suprised.
>
> ;
> ;
> ; inst dest, source
> ;
> mov esi,offset source_addr ; Get location of string
> IFDEF COPY_WHILE_CHKSUM
> mov edi,offset dest_addr ; Where to copy
> ENDIF
> mov ecx,string_length ; Get string word length
> xor eax,eax ; Zero register, clear CY
> mov edx,eax ; Zero register
> cld ; Forwards
> ;
> l0: lodsw ; Get word, source_addr++
> IFDEF COPY_WHILE_CHKSUM
> stosw ; Put word dest_addr++
> ENDIF
> adc edx,eax ; Sum to accumulator + CY
> loop l0
> ;
> adc edx,0 ; Possible last carry
> IFDEF GONNA_SUM_INTO_WHAT_YOU_CHECKSUMMED
> not edx ; Invert all bits
> ENDIF
> mov eax,edx ; To return in eax

Just to prove that the existing stuff is pretty damn good (although
not optimal, I've submitted a patch that hasn't been applied yet and
I've recently got some more stuff to look at), I converted your code
and timed it against the kernel code on a 486. These are *times*, so
smaller is better:

Your code: 3259
Kernel code: 328

Your code is an order of magnitude slower. Expect results on a
Pentium to be more spectacularly different.

Here's your converted routine. You're welcome to put it in your own
test harness and time it against the kernel code yourself. Note that
it breaks if len is not a multiple of 2.

unsigned int csum_partial(const unsigned char * buff, int len, unsigned int sum) {
__asm__("
testl %%edx,%%edx
0: lodsw # Get word, source_addr++
adcl %%eax,%%edx # Sum to accumulator + CY
loop 0b
adcl $0,%%edx # Possible last carry
"
: "=d"(sum)
: "0"(sum), "c"(len/2), "S"(buff)
: "ax", "cx", "si");
return(sum);
}

Tom.