Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset

From: Jann Horn
Date: Fri May 21 2021 - 19:09:24 EST


On Fri, May 21, 2021 at 9:14 PM Peter Oskolkov <posk@xxxxxxxxxx> wrote:
> On Fri, May 21, 2021 at 11:44 AM Andrei Vagin <avagin@xxxxxxxxxx> wrote:
> > On Thu, May 20, 2021 at 11:36 AM Peter Oskolkov <posk@xxxxxxxxxx> wrote:
> >>
> >> As indicated earlier in the FUTEX_SWAP patchset:
> >>
> >> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@xxxxxxx/
> >
> >
> > Hi Peter,
> >
> > Do you have benchmark results? How fast is it compared with futex_swap and the google switchto?
>
> Hi Andrei,
>
> I did not run benchmarks on the same machine/kernel, but umcg_swap
> between "core" tasks (your use case for gVisor) should be somewhat
> faster than futex_swap, as there is no reading from the userspace and
> no futex hash lookup/dequeue ops;

The futex code currently creates and destroys hash table elements on
wait/wake, which does involve locking, but you could probably avoid
that if you built a faster futex variant optimized for the
single-waiter case that uses a bit more kernel memory to keep a
persistent hash table element (with RCU freeing) per pre-registered
lock address around? Whether that'd be significantly faster, I don't
know.


(As a sidenote, the futex code could slow down if the number of futex
buckets isn't well-calibrated - meaning you have something like >200
distinct futex addresses per CPU core, see futex_init(). Then
futex_init() probably needs to be tuned a bit. Actually, on my work
laptop, this is what I see right now (not counting multiple waiters on
the same address in the same process, since they intentionally occupy
the same bucket):

# for tasks_dir in /proc/*/task; do cat $tasks_dir/*/syscall | grep
'^202 ' | cut -d' ' -f2 | sort | uniq; done | wc -l
1193
# cat /sys/devices/system/cpu/possible
0-3
# gdb -core=/proc/kcore -ex "print ((unsigned long *)(0x$(grep
__futex_data /proc/kallsyms | cut -d' ' -f1)))[1]" -batch
[...]
$1 = 1024

So the load factor of the futex hash table on this machine right now
is ~117%, which I think is quite a bit higher than you'd normally want
in a hash table? I don't know how representative that is though. Seems
to mostly come from the tons of Chrome processes.)