Re: [patch v3 6/7] crc32: add-slicing-by-8.diff

From: George Spelvin
Date: Tue Aug 09 2011 - 07:21:27 EST


While writing up some documentation of this algorithm, I came up with
a potential speedup. Or, at least, realized why slicing by more than
4 is so much faster than slicing by 4 or less.

Note that the inner loop of the algorithm is as follows:

+# define DO_CRC8a (tab[7][(q) & 255] ^ \
+ tab[6][(q >> 8) & 255] ^ \
+ tab[5][(q >> 16) & 255] ^ \
+ tab[4][(q >> 24) & 255])
+# define DO_CRC8b (tab[3][(q) & 255] ^ \
+ tab[2][(q >> 8) & 255] ^ \
+ tab[1][(q >> 16) & 255] ^ \
+ tab[0][(q >> 24) & 255])

+ for (--b; middle_len; --middle_len) {
+ u32 q;
+ q = crc ^ *++b;
+ crc = DO_CRC8a;
+ q = *++b;
+ crc ^= DO_CRC8b;
}

Note the data dependencies: DO_CRC8a depends on the
previous crc, which depends on the previous DO_CRC8b.
But DO_CRC8b does not depend on anything except the
input data at *++b.

It would increase parallelism to schedule DO_CRC8b before DO_CRC8a,
to start those loads before the previous crc value is available.

Maybe the compiler and/pr processor can find this parallelism already,
but if not, it might be useful to try reordering it:

# define DO_CRC8a(x) (tab[7][(x) & 255] ^ \
tab[6][((x) >> 8) & 255] ^ \
tab[5][((x) >> 16) & 255] ^ \
tab[4][((x) >> 24) & 255])
# define DO_CRC8b(x) (tab[3][(x) & 255] ^ \
tab[2][((x) >> 8) & 255] ^ \
tab[1][((x) >> 16) & 255] ^ \
tab[0][((x) >> 24) & 255])

for ( ; middle_len; --middle_len, b += 2) {
u32 q = DO_CRC8b(b[1]);
crc ^= b[0];
crc = q ^ DO_CRC8a(crc);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/