[PATCH 0/1] riscv: better network performance with memcpy, uaccess

From: Akira Tsukamoto
Date: Fri Jun 04 2021 - 05:53:47 EST


I am adding a cover letter to explain the history and details since
improvement is a combination with Gary's memcpy patch [1].

Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
my uaccess optimization patch. All results are from the same base kernel,
same rootfs and save BeagleV beta board.

First left column : beaglev 5.13.rc4 kernel [2]
Second column : Added Palmer's memcpy in C + my uaccess patch [3]
Third column : Added Gary's memcpy + my uaccess patch [4]

--- TCP recv ---
686 Mbits/sec | 700 Mbits/sec | 904 Mbits/sec
683 Mbits/sec | 701 Mbits/sec | 898 Mbits/sec
695 Mbits/sec | 702 Mbits/sec | 905 Mbits/sec

--- TCP send ---
383 Mbits/sec | 390 Mbits/sec | 393 Mbits/sec
384 Mbits/sec | 393 Mbits/sec | 392 Mbits/sec

--- UDP send ---
307 Mbits/sec | 358 Mbits/sec | 402 Mbits/sec
307 Mbits/sec | 359 Mbits/sec | 402 Mbits/sec

--- UDP recv ---
630 Mbits/sec | 799 Mbits/sec | 875 Mbits/sec
730 Mbits/sec | 796 Mbits/sec | 873 Mbits/sec


The uaccess patch is reducing pipeline stall of read after write (RAW)
by unroling load and store.
The main reason for using assembler inside uaccess.S is because the
__asm_to/copy_from_user() handling page fault must be done manually inside
the functions.

The above result is combination from Gary $B!G (Bs memcpy speeding up
by reducing
the S-mode and M-mode switching and my uaccess reducing pipeline stall for
user space uses syscall with large data.

We had a discussion of improving network performance on the BeagleV beta
board with Palmer.

Palmer suggested to use C-based string routines, which checks the unaligned
address and use 8 bytes aligned copy if the both src and dest are aligned
and if not use the current copy function.

The Gary's assembly version of memcpy is improving by not using unaligned
access in 64 bit boundary, uses shifting it after reading with offset of
aligned access, because every misaligned access is trapped and switches to
opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
and M-mode (opensbi) switching.

Processing network packets require a lot of unaligned access for the packet
header, which is not able to change the design of the header format to be
aligned.
And user applications pass large packet data with send/recf() and sendto/
recvfrom() to repeat less function calls for reading and writing data for the
optimization.

Akira

[1] https://lkml.org/lkml/2021/2/16/778
[2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
[3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
[4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary

Akira Tsukamoto (1):
riscv: prevent pipeline stall in __asm_to/copy_from_user

arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
1 file changed, 73 insertions(+), 33 deletions(-)

--
2.17.1