Re: [PATCH bpf-next v4 0/3] bpf, arm64: use BPF prog pack allocator in BPF JIT

From: Florent Revest
Date: Fri Jun 30 2023 - 13:21:04 EST


On Mon, Jun 26, 2023 at 10:58 AM Puranjay Mohan <puranjay12@xxxxxxxxx> wrote:
>
> BPF programs currently consume a page each on ARM64. For systems with many BPF
> programs, this adds significant pressure to instruction TLB. High iTLB pressure
> usually causes slow down for the whole system.
>
> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue.
> It packs multiple BPF programs into a single huge page. It is currently only
> enabled for the x86_64 BPF JIT.
>
> This patch series enables the BPF prog pack allocator for the ARM64 BPF JIT.
>
> ====================================================
> Performance Analysis of prog pack allocator on ARM64
> ====================================================
>
> To test the performance of the BPF prog pack allocator on ARM64, a stresser
> tool[2] was built. This tool loads 8 BPF programs on the system and triggers
> 5 of them in an infinite loop by doing system calls.
>
> The runner script starts 20 instances of the above which loads 8*20=160 BPF
> programs on the system, 5*20=100 of which are being constantly triggered.
>
> In the above environment we try to build Python-3.8.4 and try to find different
> iTLB metrics for the compilation done by gcc-12.2.0.
>
> The source code[3] is configured with the following command:
> ./configure --enable-optimizations --with-ensurepip=install
>
> Then the runner script is executed with the following command:
> ./run.sh "perf stat -e ITLB_WALK,L1I_TLB,INST_RETIRED,iTLB-load-misses -a make -j32"
>
> This builds Python while 160 BPF programs are loaded and 100 are being constantly
> triggered and measures iTLB related metrics.
>
> The output of the above command is discussed below before and after enabling the
> BPF prog pack allocator.
>
> The tests were run on qemu-system-aarch64 with 32 cpus, 4G memory, -machine virt,
> -cpu host, and -enable-kvm.
>
> Results
> -------
>
> Before enabling prog pack allocator:
> ------------------------------------
>
> Performance counter stats for 'system wide':
>
> 333278635 ITLB_WALK
> 6762692976558 L1I_TLB
> 25359571423901 INST_RETIRED
> 15824054789 iTLB-load-misses
>
> 189.029769053 seconds time elapsed
>
> After enabling prog pack allocator:
> -----------------------------------
>
> Performance counter stats for 'system wide':
>
> 190333544 ITLB_WALK
> 6712712386528 L1I_TLB
> 25278233304411 INST_RETIRED
> 5716757866 iTLB-load-misses
>
> 185.392650561 seconds time elapsed
>
> Improvements in metrics
> -----------------------
>
> Compilation time ---> 1.92% faster
> iTLB-load-misses/Sec (Less is better) ---> 63.16% decrease
> ITLB_WALK/1000 INST_RETIRED (Less is better) ---> 42.71% decrease
> ITLB_Walk/L1I_TLB (Less is better) ---> 42.47% decrease
>
> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@xxxxxxxxxx/
> [2] https://github.com/puranjaymohan/BPF-Allocator-Bench
> [3] https://www.python.org/ftp/python/3.8.4/Python-3.8.4.tgz
>
> Chanes in V3 => V4: Changes only in 3rd patch
> 1. Fix the I-cache maintenance: Clean the data cache and invalidate the i-Cache
> only *after* the instructions have been copied to the ROX region.
>
> Chanes in V2 => V3: Changes only in 3rd patch
> 1. Set prog = orig_prog; in the failure path of bpf_jit_binary_pack_finalize()
> call.
> 2. Add comments explaining the usage of the offsets in the exception table.
>
> Changes in v1 => v2:
> 1. Make the naming consistent in the 3rd patch:
> ro_image and image
> ro_header and header
> ro_image_ptr and image_ptr
> 2. Use names dst/src in place of addr/opcode in second patch.
> 3. Add Acked-by: Song Liu <song@xxxxxxxxxx> in 1st and 2nd patch.
>
> Puranjay Mohan (3):
> bpf: make bpf_prog_pack allocator portable
> arm64: patching: Add aarch64_insn_copy()
> bpf, arm64: use bpf_jit_binary_pack_alloc
>
> arch/arm64/include/asm/patching.h | 1 +
> arch/arm64/kernel/patching.c | 39 ++++++++
> arch/arm64/net/bpf_jit_comp.c | 145 +++++++++++++++++++++++++-----
> kernel/bpf/core.c | 8 +-
> 4 files changed, 165 insertions(+), 28 deletions(-)
>
> --
> 2.40.1
>
>

FWIW

Acked-by: Florent Revest <revest@xxxxxxxxxxxx>

Thanks for this Puranjay!