RE: [PATCH] add slice by 8 algorithm to crc32.c

From: Joakim Tjernlund
Date: Fri Aug 05 2011 - 09:34:32 EST


Joakim Tjernlund/Transmode wrote on 2011/08/05 11:22:44:
>
> "Bob Pearson" <rpearson@xxxxxxxxxxxxxxxxxxxxx> wrote on 2011/08/04 20:53:20:
> >
> > Sure... See below.
> >
> > > -----Original Message-----
> > > From: Joakim Tjernlund [mailto:joakim.tjernlund@xxxxxxxxxxxx]
> > > Sent: Thursday, August 04, 2011 6:54 AM
> > > To: Bob Pearson
> > > Cc: 'Andrew Morton'; 'frank zago'; linux-kernel@xxxxxxxxxxxxxxx
> > > Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> > >
> > > "Bob Pearson" <rpearson@xxxxxxxxxxxxxxxxxxxxx> wrote on 2011/08/02
> > > 23:14:39:
> > > >
> > > > Hi Joakim,
> > > >
> > > > Sorry to take so long to respond.
> > >
> > > No problem but please insert you answers in correct context(like I did).
> > This
> > > makes it much easier to read and comment on.
> > >
> > > >
> > > > Here are some performance data collected from the original and modified
> > > > crc32 algorithms.
> > > > The following is a simple test loop that computes the time to compute
> > 1000
> > > > crc's over 4096 bytes of data aligned on an 8 byte boundary after
> > warming
> > > > the cache. You could make other measurements but this is sort of a best
> > > > case.
> > > >
> > > > These measurements were made on a dual socket Nehalem 2.267 GHz
> > > system.
> > >
> > > Measurements on your SPARC would be good too.
> >
> > Will do. But it is decrepit and quite slow. My main motivation is to run a
> > 10G protocol so I am mostly motivated to get x86_64 going as fast as
> > possible.
>
> 64 bits may be faster on x86_64 but not on ppc32. Your latest patch gives:
> crc32: CRC_LE_BITS = 64, CRC_BE BITS = 64
> crc32: self tests passed, processed 225944 bytes in 3987640 nsec
> crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
> crc32: self tests passed, processed 225944 bytes in 2003630 nsec
> Almost a factor 2 slower.
> So in any case I don't think 64 bits should be default for all archs.
> Probably only for 64 bit archs.

I checked the asm on ppc for 32 bits crc32 and compared yours vs. mine. PPC suffers
from your version. The startup cost is much higher. I did notice one win with your
version though. The inner loop was reduced with 3 insns if one use separate arrays.
However, loading 4 separate arrays are 16 insns on PPC so I did the best thing for
ppc:

diff --git a/lib/crc32.c b/lib/crc32.c
index 4855995..e3e391f 100644
--- a/lib/crc32.c
+++ b/lib/crc32.c
@@ -51,20 +51,21 @@ static inline u32
crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
{
# ifdef __LITTLE_ENDIAN
-# define DO_CRC(x) crc = tab[0][(crc ^ (x)) & 255] ^ (crc >> 8)
-# define DO_CRC4 crc = tab[3][(crc) & 255] ^ \
- tab[2][(crc >> 8) & 255] ^ \
- tab[1][(crc >> 16) & 255] ^ \
- tab[0][(crc >> 24) & 255]
+# define DO_CRC(x) crc = t0[(crc ^ (x)) & 255] ^ (crc >> 8)
+# define DO_CRC4 crc = t3[(crc) & 255] ^ \
+ t2[(crc >> 8) & 255] ^ \
+ t1[(crc >> 16) & 255] ^ \
+ t0[(crc >> 24) & 255]
# else
-# define DO_CRC(x) crc = tab[0][((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
-# define DO_CRC4 crc = tab[0][(crc) & 255] ^ \
- tab[1][(crc >> 8) & 255] ^ \
- tab[2][(crc >> 16) & 255] ^ \
- tab[3][(crc >> 24) & 255]
+# define DO_CRC(x) crc = t0[((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
+# define DO_CRC4 crc = t0[(crc) & 255] ^ \
+ t1[(crc >> 8) & 255] ^ \
+ t2[(crc >> 16) & 255] ^ \
+ t3[(crc >> 24) & 255]
# endif
const u32 *b;
size_t rem_len;
+ const u32 *t0=tab[0], *t1=t0 + 256, *t2=t1 + 256, *t3=t2 + 256;

/* Align it */
if (unlikely((long)buf & 3 && len)) {

This reduces the inner loop with 3 insns while adding only 5 insns startup cost.
I hope this brings my crc32(32 bits) in line with yours, even on x86_64.
Please test.

Jocke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/