Re: Proposal: merged system calls

Tom May (ftom@netcom.com)
Sun, 26 May 1996 12:11:20 -0700


David S. Miller writes:

> From: Alan Cox <alan@cymru.net>
> Date: Mon, 20 May 1996 09:56:18 +0100 (BST)
>
> > - isnt it mmap that should be used to implement zero-copy
>
> The net code folds copy and checksum so the user->kernel copy is very close
> to free (it is free for most people unless there is a lot of bus activity)
>
>(Those who don't feel like having a quick lesson in Sparc assembly
>optimization skip to end to see why this is so relevant anyways.)
>
>It is more than free on the Sparc I have found with 1000 hit/sec
>detailed to the instruction profiling information sampled during a 2gb
>TCP transfer. In cases where the memcpy() code would completely stall
>(and thus clear out the entire pipeline) the csum/copy code is filling
>the stalls in with "useful" work, this is especially true with chips
>which lack a store buffer or worse lack write-allocate on the cache.

You're looking at this the wrong way. It is *always* necessary to
compute the checksum. We are trying to decide whether it would be a
win to avoid the copy. So the question is whether we can fold the
copy into the checksum without degrading the speed of the checksum and
not vice-versa. I tested the speed of csum_partial()
vs. csum_partial_copy() and csum_partial_copy_fromuser() on my systems
with the included program and got the following results (smaller
number means faster):

486/66DX2 Pentium 120MHz overdrive
1) csum_partial: 342 89
2) c_p_copy: 1018 1310
3) c_p_c_fromuser 1021 1317
4) memcpy: 978 1309
5) memcpy + c_p(dst): 2109 1783
6) memcpy + c_p(src): 2110 1394
7) c_p(src) + memcpy: 2105 1398

(Yes, my Pentium system sucks rocks on writes. Also, it beats the
hell out of me why the times for rows 1 and 4 add up to much less
than the time for row 5 on the Pentium and rows 5-7 on the 486).

csum+copy is about the same speed as memcpy, so we are getting the csum
(nearly) for free. But csum_partial() copy is much faster than the
copy+csum functions, so avoiding the copy still looks like a win.

Tom.

/* gcc -O2 -fomit-frame-pointer -o spud spud.c */

#include <sys/times.h>

#include <asm/checksum.h>
#include "checksum.c" /* arch/i386/lib/checksum.c with #includes removed */

#define SIZE 1024

struct {
long src[1024][SIZE/4];
long fill[100];
long dst[1024][SIZE/4];
} S;

#define SRC(n) ((char *)&S.src[(n)&1023])
#define DST(n) ((char *)&S.src[(n)&1023])

void
main ()
{
struct tms start, stop;
int i;

times (&start);

for (i = 0; i < 300000; i++) {
#if 0
csum_partial (DST(i), SIZE, 0);
#elif 0
csum_partial_copy (SRC(i), DST(i), SIZE, 0);
#elif 0
csum_partial_copy_fromuser (SRC(i), DST(i), SIZE, 0);
#elif 0
memcpy (DST(i), SRC(i), SIZE);
#elif 0
memcpy (DST(i), SRC(i), SIZE);
csum_partial (DST(i), SIZE, 0);
#elif 0
memcpy (DST(i), SRC(i), SIZE);
csum_partial (SRC(i), SIZE, 0);
#else
csum_partial (SRC(i), SIZE, 0);
memcpy (DST(i), SRC(i), SIZE);
#endif
}

times (&stop);

printf ("%d\n", stop.tms_utime - start.tms_utime);
}