RE: [PATCH] riscv: Optimize crc32 with Zbc extension

From: Wang, Xiao W
Date: Wed Jan 24 2024 - 00:45:14 EST


Hi Andrew,

> -----Original Message-----
> From: Andrew Jones <ajones@xxxxxxxxxxxxxxxx>
> Sent: Tuesday, January 16, 2024 11:04 PM
> To: Wang, Xiao W <xiao.w.wang@xxxxxxxxx>
> Cc: paul.walmsley@xxxxxxxxxx; palmer@xxxxxxxxxxx;
> aou@xxxxxxxxxxxxxxxxx; conor.dooley@xxxxxxxxxxxxx; heiko@xxxxxxxxx; Li,
> Haicheng <haicheng.li@xxxxxxxxx>; linux-riscv@xxxxxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] riscv: Optimize crc32 with Zbc extension
>
> On Fri, Jan 05, 2024 at 04:08:30PM +0800, Xiao Wang wrote:
> > As suggested by the B-ext spec, the Zbc (carry-less multiplication)
> > instructions can be used to accelerate CRC calculations. Currently, the
> > crc32 is the most widely used crc function inside kernel, so this patch
> > focuses on the optimization of just the crc32 APIs.
> >
> > Compared with the current table-lookup based optimization, Zbc based
> > optimization can also achieve large stride during CRC calculation loop,
> > meantime, it avoids the memory access latency of the table-lookup based
> > implementation and it reduces memory footprint.
> >
> > If Zbc feature is not supported in a runtime environment, then the
> > table-lookup based implementation would serve as fallback via alternative
> > mechanism. To avoid the performance concern of unalignment access, we
> also
> > use the fallback implementation to handle the head and tail bytes if any,
> > the main body is accelerated by Zbc extension.
> >
> > This patch is tested on QEMU VM with the kernel CRC32 selftest.
>
> Ideally we'd also have the results of some benchmarks, or at least a
> dynamic instruction count or something, but I don't suspect that the
> table-lookup would "win" over Zbc. The bigger question is whether we
> want three implementations, because I assume we want Zvbc, particularly
> since [1] says
>
> """
> Zvbc is listed as a development option for use in other algorithms,
> and will become mandatory. Scalar Zbc is now listed as an expansion
> option, i.e., it will probably not become mandatory.
> """
>
> [1] https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc

I would collect instructions count change and include the info in the commit log.
Regarding to the bigger question, I have some thoughts:

- It looks this question also stands for some other cases like lib/str*, which currently
has been accelerated by Zbb-ext, but since the in-kernel RVV support has been enabled,
we may also use RVV to accelerate the lib/str*.
ALTERNATIVE_2 mechanism allows a second code patching on the same location. So
two alternative versions for the same function can be supported.

- At this moment, I can't figure out how Zvbc can be used to better accelerate these
crc32 APIs which take just one data stream as input. I have rough idea for a batched
version API which takes multi data stream as input (a parallel implementation of below
Zbc acceleration), but not for these 3 APIs here.
BTW, the vector "clmulr" insn is not listed in the Zvbc spec, but it's useful in below implementation.
https://github.com/riscv/riscv-crypto/blob/main/doc/vector/riscv-crypto-vector-zvbc.adoc
Anyway, if we come up with a Zvbc based optimization in future, then ALTERNATIVE_2
would help.

>
>
> A few nits below.

Thanks for all the above & below comments. Will adopt most of the suggestions, except two,
please check below.

>
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@xxxxxxxxx>
> > ---
> > arch/riscv/Kconfig | 23 +++++
> > arch/riscv/lib/Makefile | 1 +
> > arch/riscv/lib/crc32.c | 216
> ++++++++++++++++++++++++++++++++++++++++
> > include/linux/crc32.h | 3 +
> > 4 files changed, 243 insertions(+)
> > create mode 100644 arch/riscv/lib/crc32.c
> >
[...]

> > +#if (BITS_PER_LONG == 64)
> > + size_t len, u32 poly, u64 poly_qt,
> > +#else
> > + size_t len, u32 poly, u32 poly_qt,
> > +#endif
>
> How about creating a new type for poly_qt, defined as u64 for xlen=64
> and u32 for xlen=32 to avoid the #ifdef?

Would make it as "unsigned long", as David's comment.

>
> > + fallback crc_fb)
> > +{
> > + size_t offset, head_len, tail_len;
> > + const unsigned long *p_ul;
> > + unsigned long s;
> > +
> > + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > + RISCV_ISA_EXT_ZBC, 1)
> > + : : : : legacy);
> > +
> > + /* Handle the unalignment head. */
> > + offset = (unsigned long)p & OFFSET_MASK;
> > + if (offset) {
> > + head_len = MIN(STEP - offset, len);
> > + crc = crc_fb(crc, p, head_len);
> > + len -= head_len;
> > + p += head_len;
> > + }
> > +
> > + tail_len = len & OFFSET_MASK;
> > + len = len >> STEP_ORDER;
> > + p_ul = (unsigned long *)p;
> > +
> > + for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > + s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
> > + /* We don't have a "clmulrh" insn, so use clmul + slli instead.
> > + */
>
> need opening /* comment wing
>
> > + asm volatile (".option push\n"
> > + ".option arch,+zbc\n"
> > + "clmul %0, %1, %2\n"
> > + "slli %0, %0, 1\n"
> > + "xor %0, %0, %1\n"
> > + "clmulr %0, %0, %3\n"
> > + "srli %0, %0, 32\n"
> > + ".option pop\n"
> > + : "=&r" (crc)
> > + : "r" (s),
> > + "r" (poly_qt),
> > + "r" ((u64)poly << 32)
> > + :);
> > +#else
> > + s = crc ^ __cpu_to_le32(*p_ul++);
> > + /* We don't have a "clmulrh" insn, so use clmul + slli instead.
>
> also here
>
> > + */
> > + asm volatile (".option push\n"
> > + ".option arch,+zbc\n"
> > + "clmul %0, %1, %2\n"
> > + "slli %0, %0, 1\n"
> > + "xor %0, %0, %1\n"
> > + "clmulr %0, %0, %3\n"
> > + ".option pop\n"
> > + : "=&r" (crc)
> > + : "r" (s),
> > + "r" (poly_qt),
> > + "r" (poly)
> > + :);
> > +#endif
>
> Instead of the #ifdef here, we could provide function wrappers for the asm
> which would be defined above in the first #ifdef ladder.
>
> > + }
> > +
> > + /* Handle the tail bytes. */
> > + if (tail_len)
> > + crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
> > + return crc;
> > +
> > +legacy:
> > + return crc_fb(crc, p, len);
> > +}
> > +
> > +u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + return crc32_le_generic(crc, p, len, CRC32_POLY_LE,
> CRC32_POLY_QT_LE,
> > + crc32_le_base);
> > +}
> > +
> > +u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
> > + CRC32C_POLY_QT_LE, __crc32c_le_base);
> > +}
> > +
> > +u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + size_t offset, head_len, tail_len;
> > + const unsigned long *p_ul;
> > + unsigned long s;
> > +
> > + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > + RISCV_ISA_EXT_ZBC, 1)
> > + : : : : legacy);
> > +
> > + /* Handle the unalignment head. */
> > + offset = (unsigned long)p & OFFSET_MASK;
> > + if (offset) {
> > + head_len = MIN(STEP - offset, len);
> > + crc = crc32_be_base(crc, p, head_len);
> > + len -= head_len;
> > + p += head_len;
> > + }
> > +
> > + tail_len = len & OFFSET_MASK;
> > + len = len >> STEP_ORDER;
> > + p_ul = (unsigned long *)p;
> > +
> > + for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > + s = (unsigned long)crc << 32;
> > + s ^= __cpu_to_be64(*p_ul++);
> > +#else
> > + s = crc ^ __cpu_to_be32(*p_ul++);
> > +#endif
>
> Could write the above without #ifdef with
>
> if (IS_ENABLED(CONFIG_64BIT)) {
> ...
> } else {
> ...
> }

I got a warning for riscv32 build with this change, gcc v12.2.0
./arch/riscv/lib/crc32.c:191:40: warning: left shift count >= width of type [-Wshift-count-overflow]
It looks compiler takes the always-impossible code branch into consideration.

BRs,
Xiao