RE: [patch v3 7/7] crc32: final-cleanup.diff

From: Bob Pearson
Date: Tue Aug 09 2011 - 19:06:05 EST




>
> Doing this adds one insn to ppc. What is good for x86 isn't for
> pcc32 (and I guess any modern arch).

I looked at the assembly output from the original code and I see two
compares at the end of the first loop, one for --len and one for (buf & 3)
which it has to compute. You are better off just computing buf & 3 once and
comparing with len once to compute init_len then in each iteration in the
loop you only have to compare one thing.

As proposed:
init_len = min((-((uintptr_t)buf)) & 3, len);
...

/* process unaligned initial bytes */
for (i = 0; i < init_len; i++)
DO_CRC(*buf++);


crc32x_le:
<.....>
negq %rcx # (-buf)
andl $3, %ecx # (-buf) & 3
cmpq %rdx, %rcx # min()
cmova %rdx, %rcx # init_len =
xorl %edi, %edi # i = 0
<.....>
jmp .L2
.L3:
# crc = tab[0][(crc ^ (*buf++)) & 255] ^ (crc >> 8)
movb (%rsi,%rdi), %dl # buf[i]
incq %rdi # buf++
xorl %eax, %edx # crc ^ *buf++
shrl $8, %eax # crc >> 8
movzbl %dl, %edx # & 255
xorl crc32table_le(,%rdx,4), %eax # crc =
tab[...] ^ (crc >> 8)
.L2:
cmpq %rcx, %rdi # compare i with
init_len
jb .L3
<.....>

As was originally:
if (unlikely((long)buf & 3 && len)) {
do {
DO_CRC(*buf++);
} while ((--len) && ((long)buf)&3);
}

crc32x_le:
pushq %rbp
movl %edi, %eax
pushq %rbx
testb $3, %sil
je .L16
testq %rdx, %rdx
.L25:
je .L16
movb (%rsi), %cl # *p
incq %rsi # p++
xorl %eax, %ecx # crc ^ *p
shrl $8, %eax # crc >> 8
movzbl %cl, %ecx # ... & 255
xorl crc32table_le(,%rcx,4), %eax # tab[...] ^ (crc >>
8)
decq %rdx # len--
je .L16 # if len != 0
continue
testb $3, %sil # buf & 3
jmp .L25 # if buf & 3
continue
.L16:

The first loop has 8 instructions second one has 11 instructions. First loop
has slightly more setup to compute init_len, second loop has a couple of
extra branches to compute the if() outside of the loop.

I get better performance with the new form of the loop. How, does PPC get
better results with two tests?

>
> > >
> > > -/* implements slicing-by-4 or slicing-by-8 algorithm */
> > > -static inline u32
> > > -crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32
> > > (*tab)[256])
> >
> > After careful measurements and looking at asm code I figured out that
> there
> > was no penalty for using 2D array. That allowed me to go back to the
> > original form.
>
> What about my suggestion to assign each table to a ptr? This was good
> for ppc so if it doesn't slow x86 it should be added.

Gcc is very clever and inlined crc32_body() into crc32_le() and crc32_be()
and then replaced references to tab[x][y] with references to crc32table_le
and crc32table_be plus a constant offset which depended on the first index
([x]). So in effect it completely unrolled back to 1D arrays as
(crc32table_le + offset)[y], which is equivalent to the t0_le[y] code I had
before.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/