Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()

From: Yonghong Song
Date: Thu Feb 08 2024 - 19:18:53 EST

Next message: Tejun Heo: "Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization"
Previous message: Andi Kleen: "Re: [PATCH V5 12/12] perf intel-pt: Add a test for pause / resume"
In reply to: Marco Elver: "Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()"
Next in thread: patchwork-bot+netdevbpf: "Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2/8/24 2:54 AM, Marco Elver wrote:

On Thu, Feb 08, 2024 at 08:37AM +0100, Marco Elver wrote:

On Thu, 8 Feb 2024 at 00:58, Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

On 2/7/24 4:26 AM, Marco Elver wrote:

In various performance profiles of kernels with BPF programs attached,
bpf_local_storage_lookup() appears as a significant portion of CPU
cycles spent. To enable the compiler generate more optimal code, turn
bpf_local_storage_lookup() into a static inline function, where only the
cache insertion code path is outlined

Notably, outlining cache insertion helps avoid bloating callers by
duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
architectures which do not inline spin_lock/unlock, such as x86), which
would cause the compiler produce worse code by deciding to outline
otherwise inlinable functions. The call overhead is neutral, because we
make 2 calls either way: either calling raw_spin_lock_irqsave() and
raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
which calls raw_spin_lock_irqsave(), followed by a tail-call to
raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
optimized uninstrumented builds) turns it into a plain jump. The call to
__bpf_local_storage_insert_cache() can be elided entirely if
cacheit_lockit is a false constant expression.

Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
reboot between each trial; x86 defconfig + BPF, clang 16) this produces
improvements in throughput and latency in the majority of cases, with an
average (geomean) improvement of 8%:

[...]

include/linux/bpf_local_storage.h | 30 ++++++++++-
kernel/bpf/bpf_local_storage.c | 52 +++++--------------
.../bpf/prog_tests/task_local_storage.c | 6 ---
.../selftests/bpf/progs/cgrp_ls_recursion.c | 26 ----------
.../selftests/bpf/progs/task_ls_recursion.c | 17 ------
5 files changed, 41 insertions(+), 90 deletions(-)

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 173ec7f43ed1..dcddb0aef7d8 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
struct bpf_local_storage_cache *cache,
bool bpf_ma);

-struct bpf_local_storage_data *
+void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
+ struct bpf_local_storage_map *smap,
+ struct bpf_local_storage_elem *selem);
+/* If cacheit_lockit is false, this lookup function is lockless */
+static inline struct bpf_local_storage_data *
bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
struct bpf_local_storage_map *smap,
- bool cacheit_lockit);
+ bool cacheit_lockit)
+{
+ struct bpf_local_storage_data *sdata;
+ struct bpf_local_storage_elem *selem;
+
+ /* Fast path (cache hit) */
+ sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx],
+ bpf_rcu_lock_held());
+ if (sdata && rcu_access_pointer(sdata->smap) == smap)
+ return sdata;

I think we should focus on fast path (your v1 patch)
and I suppose most production environments
want to hit fast path in most times. In your production environment did
you see more than 16 local storage maps per entity (task/sk/inode)?

I think having more than 16 local storage maps isn't entirely unlikely
as eBPF usage grows. But at the moment, it should be rare.

In the fast path, the memory accesses are
two from local_storage->cache[smap->cache_idx] and
one from sdata->smap

+
+ /* Slow path (cache miss) */
+ hlist_for_each_entry_rcu(selem, &local_storage->list, snode,
+ rcu_read_lock_trace_held())
+ if (rcu_access_pointer(SDATA(selem)->smap) == smap)
+ break;

But if we reach slow path here which means we have more than 16 local
storage maps, then traversing the list and getting SDATA(selem)->smap
will be very expensive, in addition to memory accesses in fast path.

I suppose here we mostly care about socket local storage since it is
totally possible for a production workload to have millions of sockets.
To improve performance, fast path should hit in most cases.
If there are too many sk local storage maps, some kind of sharing
can be done so multiple applications might be using a single sk
local storage.

Your above inlining/outlining analysis also show how tricky it is
for compilation optimization. Without profiling, it is totally
possible that compiler might do optimization differently in
the future.

Sure, but it's usually the case that we have to help the compiler a
little to produce more optimal code - if the compiler becomes stupid
in future, we need either fix the compiler or help it some more.

So here is my suggestion, let us do inlining
for fast path and focus on performance of fast path.

The slow-path (iterate list w/o cache insertion) is still relatively
small (it's a pointer-chasing loop and a compare), and I decided that
it can be justified inlining it. Martin asked in v1 why there were
slowdowns above 16 local maps, and I analyzed, and concluded that
inlining most is needed to fix and does not hurt performance: in fact,
the current version is better than v1 in all cases (even for 16 maps
or below).

Let me know which version you prefer, and I'll change it. However,
based on the results, I would prefer the current version.

I compared asm code between v1 and v2. In bpf_local_storage_update(),
bpf_local_storage_lookup() is called twice with cacheit_lockit == false.
Fully inlining bpf_local_storage_lookup() in bpf_local_storage_update()
should generate better code compared to v1.

So let us do v2.

Acked-by: Yonghong Song <yonghong.song@xxxxxxxxx>

FTR, these were the results going from v1 (before) -> v2 (after):

+---- Local Storage ----------------------
|
| + num_maps: 1
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%)
| +- hits latency | 25.913 ns/op | 25.598 ns/op (-1.2%)
| +- important_hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%)
| +- hits latency | 22.521 ns/op | 22.259 ns/op (-1.2%)
| +- important_hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%)
|
| + num_maps: 10
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 37.583 M ops/s | 38.099 M ops/s (+1.4%)
| +- hits latency | 26.609 ns/op | 26.248 ns/op (-1.4%)
| +- important_hits throughput | 3.758 M ops/s | 3.810 M ops/s (+1.4%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 40.698 M ops/s | 41.145 M ops/s (+1.1%)
| +- hits latency | 24.573 ns/op | 24.307 ns/op (-1.1%)
| +- important_hits throughput | 14.535 M ops/s | 14.695 M ops/s (+1.1%)
|
| + num_maps: 16
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 38.061 M ops/s | 38.341 M ops/s ( ~ )
| +- hits latency | 26.275 ns/op | 26.083 ns/op ( ~ )
| +- important_hits throughput | 2.379 M ops/s | 2.396 M ops/s ( ~ )
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 40.890 M ops/s | 41.338 M ops/s (+1.1%)
| +- hits latency | 24.458 ns/op | 24.193 ns/op (-1.1%)
| +- important_hits throughput | 13.010 M ops/s | 13.153 M ops/s (+1.1%)
|
| + num_maps: 17
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 31.799 M ops/s | 32.756 M ops/s (+3.0%)
| +- hits latency | 31.448 ns/op | 30.530 ns/op (-2.9%)
| +- important_hits throughput | 1.873 M ops/s | 1.929 M ops/s (+3.0%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 35.284 M ops/s | 36.110 M ops/s (+2.3%)
| +- hits latency | 28.343 ns/op | 27.697 ns/op (-2.3%)
| +- important_hits throughput | 10.742 M ops/s | 10.993 M ops/s (+2.3%)
|
| + num_maps: 24
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 17.947 M ops/s | 19.937 M ops/s (+11.1%)
| +- hits latency | 55.725 ns/op | 50.166 ns/op (-10.0%)
| +- important_hits throughput | 0.748 M ops/s | 0.831 M ops/s (+11.1%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 21.379 M ops/s | 23.332 M ops/s (+9.1%)
| +- hits latency | 46.775 ns/op | 42.865 ns/op (-8.4%)
| +- important_hits throughput | 6.014 M ops/s | 6.564 M ops/s (+9.1%)
|
| + num_maps: 32
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 13.279 M ops/s | 14.626 M ops/s (+10.1%)
| +- hits latency | 75.317 ns/op | 68.381 ns/op (-9.2%)
| +- important_hits throughput | 0.416 M ops/s | 0.458 M ops/s (+10.2%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 16.444 M ops/s | 17.906 M ops/s (+8.9%)
| +- hits latency | 60.816 ns/op | 55.865 ns/op (-8.1%)
| +- important_hits throughput | 4.590 M ops/s | 4.998 M ops/s (+8.9%)
|
| + num_maps: 100
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 4.912 M ops/s | 5.528 M ops/s (+12.5%)
| +- hits latency | 207.291 ns/op | 183.059 ns/op (-11.7%)
| +- important_hits throughput | 0.049 M ops/s | 0.055 M ops/s (+12.7%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 6.039 M ops/s | 6.498 M ops/s (+7.6%)
| +- hits latency | 167.325 ns/op | 152.877 ns/op (-8.6%)
| +- important_hits throughput | 1.577 M ops/s | 1.697 M ops/s (+7.6%)
|
| + num_maps: 1000
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 0.342 M ops/s | 0.354 M ops/s (+3.6%)
| +- hits latency | 2930.550 ns/op | 2827.139 ns/op (-3.5%)
| +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ )
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 0.413 M ops/s | 0.403 M ops/s (-2.5%)
| +- hits latency | 2427.830 ns/op | 2487.555 ns/op (+2.5%)
| +- important_hits throughput | 0.104 M ops/s | 0.101 M ops/s (-2.6%)
|
| Geomean:
| hits throughput: 102.93%
| hits latency: 97.11%
| important_hits throughput: 102.77%

Next message: Tejun Heo: "Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization"
Previous message: Andi Kleen: "Re: [PATCH V5 12/12] perf intel-pt: Add a test for pause / resume"
In reply to: Marco Elver: "Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()"
Next in thread: patchwork-bot+netdevbpf: "Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]