RE: [PATCH] riscv: Optimize crc32 with Zbc extension

From: Wang, Xiao W
Date: Wed Jan 24 2024 - 02:22:07 EST


Hi David,

> -----Original Message-----
> From: David Laight <David.Laight@xxxxxxxxxx>
> Sent: Wednesday, January 17, 2024 1:16 AM
> To: 'Andrew Jones' <ajones@xxxxxxxxxxxxxxxx>; Wang, Xiao W
> <xiao.w.wang@xxxxxxxxx>
> Cc: paul.walmsley@xxxxxxxxxx; palmer@xxxxxxxxxxx;
> aou@xxxxxxxxxxxxxxxxx; conor.dooley@xxxxxxxxxxxxx; heiko@xxxxxxxxx; Li,
> Haicheng <haicheng.li@xxxxxxxxx>; linux-riscv@xxxxxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: RE: [PATCH] riscv: Optimize crc32 with Zbc extension
>
> ...
> > > +static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
> > > +#if (BITS_PER_LONG == 64)
> > > + size_t len, u32 poly, u64 poly_qt,
> > > +#else
> > > + size_t len, u32 poly, u32 poly_qt,
> > > +#endif
> >
> > How about creating a new type for poly_qt, defined as u64 for xlen=64
> > and u32 for xlen=32 to avoid the #ifdef?
>
> unsigned long ?

Yes, it's simpler. thanks.

>
> ...
> > > + for (int i = 0; i < len; i++) {
> > > +#if (BITS_PER_LONG == 64)
> > > + s = (unsigned long)crc << 32;
> > > + s ^= __cpu_to_be64(*p_ul++);
> > > +#else
> > > + s = crc ^ __cpu_to_be32(*p_ul++);
> > > +#endif
> >
> > Could write the above without #ifdef with
>
> Haven't I seen a bpf patch that rather implies that byteswap
> is likely to be truly horrid?
>
> I've not tried to parse the crc code (although I do understand
> how it should work). But I'm surprised you need a byteswap.
>
> After all, the crc is basically a long division of the buffer
> by the crc constant.

For the *_be mode, the API expects that within each byte from the data
stream,the higher order poly bit sits at more significant bit position, and within
the parameter "u32 crc", the higher order poly bit sits at more significant bit
position, regardless of CPU endianness.

After the unsigned long data loading from memory into a register, the
"higher order poly bit sits at more significant bit position" is true for each byte,
but not for the whole register on a little endian CPU. In order to do the XOR in
s ^= __cpu_to_be64(*p_ul++), and in order to run the CLMUL insn at XLEN
scale, we need to rearrange the whole register bits to get them rightly ordered.

>
> The CRC I've done recently is the hdlc crc-16.
> My nios version (also mips-like) has:
>
> static __inline__ uint32_t
> crc_step(uint32_t crc, uint32_t byte_val)
> {
> #if defined(crc_step_ci)
> return crc_step_ci(byte_val, crc);
> #else
> uint32_t t = crc ^ (byte_val & 0xff);
> t = (t ^ t << 4) & 0xff;
> return crc >> 8 ^ t << 8 ^ t << 3 ^ t >> 4;
> #endif
> }
>
> I normally use a custom instruction for the logic - one clock.
> But the C code is only a couple of clocks slower that the best
> table lookup version.
> On anything pipelined and multi-issue the C code is likely to
> be faster than a lookup table!

I don't understand your code piece, but it looks only the least significant byte
of "uint32_t byte_val" is involved in this crc16 calculation, I don't understand
how it can work.

BRs,
Xiao

> I don't know if any of the 32bit crc can be reduced the same way.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1
> 1PT, UK
> Registration No: 1397386 (Wales)