Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

From: Mirsad Todorovac
Date: Sun Nov 05 2023 - 14:27:56 EST


On 11/5/23 16:33, Andrew Lunn wrote:

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores and CPUs on the same memory controller
for certain time I/O takes to finish.

Please provide benchmark data to show this is a real issue, and the
patch fixes it.

Certainly, Sir, but I have to figure out what to measure.

To better think of it, I actually do not have a system with a physical RTL8411b, this patch
was made by the finding in the visual inspection of the code.

FWIW, separation of code and data is the design principle that is strongly endorsed lately
and it seems like a good practice that prevents a range of security breaches and attacks:

[1] https://blog.klipse.tech/databook/2020/10/02/separate-code-data.html

[2] Reduce system complexity by separating Code from Data,
https://livebook.manning.com/book/data-oriented-programming/chapter-2/v-2/

In this case, data is hard-coded.

The resulting code would be smaller in size and execution time, and IMHO more readable,
(in a table), but I will not enter much discussion for you have made your mind already.

Additionally, I would like to "inline" many functions, as I think that call/return
sequences with stack frame generation /destruction are more expensive than inlining the
small one liners.

Please provide benchmarks to show the compiler is getting this wrong,
and inline really is needed.

I think I am by now technologically up to that:

"drivers/net/ethernet/realtek/r8169_main.s" 302034 lines
-------------------------------------------------------------------------------------
7500 r8168_mac_ocp_write:
7501 .LVL488:
7502 .LFB6322:
7503 .loc 1 900 1 is_stmt 1 view -0
7504 .cfi_startproc
7505 1: call __fentry__
7506 .section __mcount_loc, "a",@progbits
7507 .quad 1b
7508 .previous
7509 .loc 1 901 2 view .LVU1955
7510 .loc 1 903 2 view .LVU1956
7511 .loc 1 903 7 view .LVU1957
7512 .loc 1 903 10 view .LVU1958
7513 .loc 1 903 33 view .LVU1959
7514 .loc 1 903 57 view .LVU1960
7515 .loc 1 903 88 view .LVU1961
7516 .loc 1 903 95 view .LVU1962
7517 .loc 1 900 1 is_stmt 0 view .LVU1963
7518 pushq %rbp
7519 .cfi_def_cfa_offset 16
7520 .cfi_offset 6, -16
7521 movq %rsp, %rbp
7522 .cfi_def_cfa_register 6
7523 pushq %r15
7524 .cfi_offset 15, -24
7525 .loc 1 903 103 view .LVU1964
7526 leaq 6692(%rdi), %r15
7527 .loc 1 900 1 view .LVU1965
7528 pushq %r14
7529 .cfi_offset 14, -32
7530 movl %edx, %r14d
7531 pushq %r13
7532 pushq %r12
7533 .cfi_offset 13, -40
7534 .cfi_offset 12, -48
7535 movq %rdi, %r12
7536 .loc 1 903 103 view .LVU1966
7537 movq %r15, %rdi
7538 .LVL489:
7539 .loc 1 900 1 view .LVU1967
7540 pushq %rbx
7541 .cfi_offset 3, -56
7542 .loc 1 900 1 view .LVU1968
7543 movl %esi, %ebx
7544 .loc 1 903 103 view .LVU1969
7545 call _raw_spin_lock_irqsave
7546 .LVL490:
7547 .LBB3557:
7548 .LBB3558:
7549 .loc 1 893 6 view .LVU1970
7550 movl %ebx, %edi
7551 .LBE3558:
7552 .LBE3557:
7553 .loc 1 903 103 view .LVU1971
7554 movq %rax, %r13
7555 .LVL491:
7556 .loc 1 903 5 is_stmt 1 view .LVU1972
7557 .loc 1 904 2 view .LVU1973
7558 .LBB3564:
7559 .LBI3557:
7560 .loc 1 891 13 view .LVU1974
7561 .LBB3563:
7562 .loc 1 893 2 view .LVU1975
7563 .loc 1 893 6 is_stmt 0 view .LVU1976
7564 call rtl_ocp_reg_failure
7565 .LVL492:
7566 .loc 1 893 5 view .LVU1977
7567 testb %al, %al
7568 jne .L375
7569 .LVL493:
7570 .LBB3559:
7571 .LBI3559:
7572 .loc 1 891 13 is_stmt 1 view .LVU1978
7573 .LBB3560:
7574 .loc 1 896 2 view .LVU1979
7575 .loc 1 896 28 is_stmt 0 view .LVU1980
7576 sall $15, %ebx
7577 .LVL494:
7578 .loc 1 896 58 view .LVU1981
7579 movq (%r12), %rax
7580 .loc 1 896 35 view .LVU1982
7581 orl %r14d, %ebx
7582 .loc 1 896 2 view .LVU1983
7583 orl $-2147483648, %ebx
7584 .LVL495:
7585 .LBB3561:
7586 .LBI3561:
7587 .loc 2 66 120 is_stmt 1 view .LVU1984
7588 .LBB3562:
7589 .loc 2 66 168 view .LVU1985
7590 #APP
7591 # 66 "./arch/x86/include/asm/io.h" 1
7592 movl %ebx,176(%rax)
7593 # 0 "" 2
7594 .LVL496:
7595 #NO_APP
7596 .L375:
7597 .loc 2 66 168 is_stmt 0 view .LVU1986
7598 .LBE3562:
7599 .LBE3561:
7600 .LBE3560:
7601 .LBE3559:
7602 .LBE3563:
7603 .LBE3564:
7604 .loc 1 905 2 is_stmt 1 view .LVU1987
7605 .loc 1 905 7 view .LVU1988
7606 .loc 1 905 10 view .LVU1989
7607 .loc 1 905 33 view .LVU1990
7608 .loc 1 905 57 view .LVU1991
7609 .loc 1 905 88 view .LVU1992
7610 .loc 1 905 95 view .LVU1993
7611 movq %r13, %rsi
7612 movq %r15, %rdi
7613 call _raw_spin_unlock_irqrestore
7614 .LVL497:
7615 .loc 1 905 5 view .LVU1994
7616 .loc 1 906 1 is_stmt 0 view .LVU1995
7617 popq %rbx
7618 .cfi_restore 3
7619 popq %r12
7620 .cfi_restore 12
7621 .LVL498:
7622 .loc 1 906 1 view .LVU1996
7623 popq %r13
7624 .cfi_restore 13
7625 .LVL499:
7626 .loc 1 906 1 view .LVU1997
7627 popq %r14
7628 .cfi_restore 14
7629 .LVL500:
7630 .loc 1 906 1 view .LVU1998
7631 popq %r15
7632 .cfi_restore 15
7633 .LVL501:
7634 .loc 1 906 1 view .LVU1999
7635 popq %rbp
7636 .cfi_restore 6
7637 .cfi_def_cfa 7, 8
7638 xorl %eax, %eax
7639 xorl %edx, %edx
7640 xorl %esi, %esi
7641 xorl %edi, %edi
7642 jmp __x86_return_thunk
7643 .cfi_endproc
7644 .LFE6322:
7645 .size r8168_mac_ocp_write, .-r8168_mac_ocp_write
7646 .p2align 4
7647 .section __patchable_function_entries,"awo",@progbits,rtl_eriar_cond_check
7648 .align 8
7649 .quad .LPFE44
7650 .text
7651 .LPFE44:
7652 nop
7653 nop
7654 nop
7655 nop
-------------------------------------------------------------------------------------

The call of the function is the actual call:

-------------------------------------------------------------------------------------
39334 .LBE11119:
39335 .loc 1 3112 2 is_stmt 1 view .LVU10399
39336 xorl %edx, %edx
39337 movl $64556, %esi
39338 movq %r13, %rdi
39339 call r8168_mac_ocp_write
-------------------------------------------------------------------------------------

The command used for generating the assembly was taken from .o.cmd file and
added -save-temps as the only change:

$ gcc -Wp,-MMD,drivers/net/ethernet/realtek/.r8169_main.o.d -save-temps -nostdinc \
-I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi \
-I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi \
-include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h \
-include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= \
-std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing \
-mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 \
-falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup \
-mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables \
-mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix \
-mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 \
-fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong \
-fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection \
-fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 \
-fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration \
-Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs \
-Wno-frame-address -Wno-address-of-packed-member -Wframe-larger-than=1024 -Wno-main \
-Wno-unused-but-set-variable -Wno-unused-const-variable -Wvla -Wno-pointer-sign -Wcast-function-type \
-Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time \
-Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wno-unused-but-set-variable \
-Wno-unused-const-variable -Wno-restrict -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation \
-Wno-stringop-overflow -Wno-stringop-truncation -Wno-missing-field-initializers -Wno-type-limits \
-Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -g -gdwarf-5 -fsanitize=bounds-strict \
-fsanitize=shift -fsanitize=bool -fsanitize=enum -DMODULE -DKBUILD_BASENAME='"r8169_main"' \
-DKBUILD_MODNAME='"r8169"' -D__KBUILD_MODNAME=kmod_r8169 -c -o drivers/net/ethernet/realtek/r8169_main.o \
drivers/net/ethernet/realtek/r8169_main.c ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr \
--hacks=skylake --retpoline --rethunk --sls --stackval --static-call --uaccess --prefix=16 \
--module drivers/net/ethernet/realtek/r8169_main.o

This is a build against net-next. Please find the attached config.

RTL_(R|W)(8|16|32) family are obviously macros:

#define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

This is the writel():

7590 #APP
7591 # 66 "./arch/x86/include/asm/io.h" 1
7592 movl %ebx,176(%rax)
7593 # 0 "" 2
7594 .LVL496:
7595 #NO_APP

writel() looks optimal.

Hope this helps.

Until there are benchmarks: NACK.

That means I've got a Reviewed-by: Jacob Keller and two NACKs.

I am voted out. :-/

I suppose one NACK from a maintainer is sufficient to halt the patch.

Going back to the documentation, the drawing board, and of course, the Source. :-)

Best regards,
Mirsad Todorovac

Attachment: net-next-config.xz
Description: application/xz