Re: [PATCH v3 0/2] execve scalability issues, part 1

From: Dennis Zhou
Date: Fri Aug 25 2023 - 11:15:10 EST


Hello,

On Wed, Aug 23, 2023 at 07:06:07AM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
>
> I borrowed the bench from here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
>
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
>
> It prints a result every second.
>
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
>
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
>
> Even if this code could be patched to dodge these counters, I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
>
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after: 186061.81 (+39%)
>
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
>
> __pv_queued_spin_lock_slowpath+1
> _raw_spin_lock_irqsave+57
> folio_lruvec_lock_irqsave+91
> release_pages+590
> tlb_batch_pages_flush+61
> tlb_finish_mmu+101
> exit_mmap+327
> __mmput+61
> begin_new_exec+1245
> load_elf_binary+712
> bprm_execve+644
> do_execveat_common.isra.0+429
> __x64_sys_execve+50
> do_syscall_64+46
> entry_SYSCALL_64_after_hwframe+110
>
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
>
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
>
> v3:
> - fix !CONFIG_SMP build
> - drop the backtrace from fork commit message
>
> v2:
> - force bigger alignment on alloc
> - rename "counters" to "nr_counters" and pass prior to lock key
> - drop {}'s for single-statement loops
>
>
> Mateusz Guzik (2):
> pcpcntr: add group allocation/free
> fork: group allocation of per-cpu counters for mm struct
>
> include/linux/percpu_counter.h | 39 ++++++++++++++++++----
> kernel/fork.c | 14 ++------
> lib/percpu_counter.c | 61 +++++++++++++++++++++++-----------
> 3 files changed, 77 insertions(+), 37 deletions(-)
>
> --
> 2.41.0
>

I've applied both to percpu#for-6.6.

Thanks,
Dennis