[PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
Date: Wed Dec 13 2023 - 10:46:25 EST


The current non-ZBB implementation of 'strlen' function iterates the
memory bytewise, looking for a zero byte. It could be optimized to use
the wordwise iteration instead, so we will process 4/8 bytes of memory
at a time.

Word could be tested for containing the zero byte in just 4 operations:

1. Subtract 0x0101..01 from the word. If there was a zero byte in the
word, the leading zero bit of this byte will turn to 1.
2. Calculate ~word.
3. And the subtraction result with ~word. That way we will know if some
bits were 0 and then turned to 1.
4. And the result of step 3 with 0x8080..80. 0x8080..90 filters leading
bits for every byte in the word, so we will know if one of them turned
from 0 to 1.

This patch modifies the riscv-specific strlen function to work in the
following way:

1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
to align it.
2. After the address is aligned, iterate words of memory and check them
for containing the zero byte using the algorithm described above.
3. When we found a word with a zero byte, iterate through the word
bytewise to get the exact position of the 0 and return the corresponding
string length.

Here you can find the benchmarking results for the VisionFive2 board
comparing the old and new implementations of the strlen function.

Size: 1 (+-0), mean_old: 673, mean_new: 666
Size: 2 (+-0), mean_old: 672, mean_new: 676
Size: 4 (+-0), mean_old: 685, mean_new: 659
Size: 8 (+-0), mean_old: 682, mean_new: 673
Size: 16 (+-0), mean_old: 718, mean_new: 694
Size: 32 (+-0), mean_old: 753, mean_new: 726
Size: 64 (+-3), mean_old: 835, mean_new: 773
Size: 128 (+-8), mean_old: 994, mean_new: 821
Size: 256 (+-4), mean_old: 1314, mean_new: 924
Size: 512 (+-38), mean_old: 1978, mean_new: 1105
Size: 1024 (+-40), mean_old: 3263, mean_new: 1542
Size: 2048 (+-54), mean_old: 5871, mean_new: 2352
Size: 4096 (+-155), mean_old: 11061, mean_new: 3972
Size: 8192 (+-542), mean_old: 21431, mean_new: 7214
Size: 16384 (+-43), mean_old: 42239, mean_new: 13722
Size: 32768 (+-2712), mean_old: 85674, mean_new: 28471
Size: 65536 (+-907), mean_old: 189219, mean_new: 74236
Size: 131072 (+-1343), mean_old: 377023, mean_new: 147130
Size: 262144 (+-411), mean_old: 752993, mean_new: 292799
Size: 524288 (+-12242), mean_old: 1504279, mean_new: 583952

The function was tested on different string lengths and random offsets
(to check how it works with unaligned addresses). The clocks count were
measured with ktime_get and the mean values are calculated from 1000
test runs for every string length.

You can notice that the new function is slightly faster for small string
lengths and 2-3 times faster for longer strings, therefore I believe
this change could be really useful.

Signed-off-by: Ivan Orlov <ivan.orlov0322@xxxxxxxxx>
---
arch/riscv/lib/strlen.S | 72 +++++++++++++++++++++++++++++++----------
1 file changed, 55 insertions(+), 17 deletions(-)

diff --git a/arch/riscv/lib/strlen.S b/arch/riscv/lib/strlen.S
index 8ae3064e45ff..76486dd3c07d 100644
--- a/arch/riscv/lib/strlen.S
+++ b/arch/riscv/lib/strlen.S
@@ -5,29 +5,67 @@
#include <asm/alternative-macros.h>
#include <asm/hwcap.h>

-/* int strlen(const char *s) */
-SYM_FUNC_START(strlen)
+#define REP_80 __REG_SEL(0x8080808080808080, 0x80808080)
+#define REP_01 __REG_SEL(0x0101010101010101, 0x01010101)

+/* size_t strlen(const char *s) */
+SYM_FUNC_START(strlen)
ALTERNATIVE("nop", "j strlen_zbb", 0, RISCV_ISA_EXT_ZBB, CONFIG_RISCV_ISA_ZBB)

/*
- * Returns
- * a0 - string length
- *
- * Parameters
- * a0 - String to measure
- *
- * Clobbers:
- * t0, t1
- */
- mv t1, a0
+ * Returns
+ * a0 - string length
+ *
+ * Parameters
+ * a0 - String to measure
+ *
+ * Clobbers:
+ * t0, t1, t2, t3, t4
+ */
+
+
+ mv t4, a0
+
+ /* Check the address memory alignment */
+ andi t2, a0, SZREG-1
+ beqz t2, 2f
+
+ /* Get SZREG - (address remainder) */
+ xori t2, t2, SZREG-1
+ addi t2, t2, 1
+
+ /* align the address */
+ add t2, a0, t2
1:
- lbu t0, 0(t1)
- beqz t0, 2f
- addi t1, t1, 1
- j 1b
+ beq a0, t2, 2f
+ lbu t1, 0(a0)
+ beqz t1, 5f
+ addi a0, a0, 1
+ j 1b
2:
- sub a0, t1, a0
+ li t2, REP_80
+ li t3, REP_01
+3:
+ /* v contains 0 byte if (v - 0x0101..01) & ~v & 0x8080..80 != 0 */
+ REG_L t0, 0(a0)
+ sub t1, t0, t3
+ not t0, t0
+ and t1, t1, t0
+ and t1, t1, t2
+ bnez t1, 4f
+ addi a0, a0, SZREG
+ j 3b
+4:
+ /*
+ * We found the word with 0, iterate it byte-wise to find the actual
+ * string length.
+ */
+ lbu t0, 0(a0)
+ beqz t0, 5f
+ addi a0, a0, 1
+ j 4b
+5:
+ sub a0, a0, t4
ret

/*
--
2.34.1