Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()

From: Marco Elver
Date: Thu Feb 08 2024 - 06:02:19 EST


On Thu, Feb 08, 2024 at 08:37AM +0100, Marco Elver wrote:
> On Thu, 8 Feb 2024 at 00:58, Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> > On 2/7/24 4:26 AM, Marco Elver wrote:
> > > In various performance profiles of kernels with BPF programs attached,
> > > bpf_local_storage_lookup() appears as a significant portion of CPU
> > > cycles spent. To enable the compiler generate more optimal code, turn
> > > bpf_local_storage_lookup() into a static inline function, where only the
> > > cache insertion code path is outlined
> > >
> > > Notably, outlining cache insertion helps avoid bloating callers by
> > > duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
> > > architectures which do not inline spin_lock/unlock, such as x86), which
> > > would cause the compiler produce worse code by deciding to outline
> > > otherwise inlinable functions. The call overhead is neutral, because we
> > > make 2 calls either way: either calling raw_spin_lock_irqsave() and
> > > raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
> > > which calls raw_spin_lock_irqsave(), followed by a tail-call to
> > > raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
> > > optimized uninstrumented builds) turns it into a plain jump. The call to
> > > __bpf_local_storage_insert_cache() can be elided entirely if
> > > cacheit_lockit is a false constant expression.
> > >
> > > Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
> > > reboot between each trial; x86 defconfig + BPF, clang 16) this produces
> > > improvements in throughput and latency in the majority of cases, with an
> > > average (geomean) improvement of 8%:
> [...]
> > > include/linux/bpf_local_storage.h | 30 ++++++++++-
> > > kernel/bpf/bpf_local_storage.c | 52 +++++--------------
> > > .../bpf/prog_tests/task_local_storage.c | 6 ---
> > > .../selftests/bpf/progs/cgrp_ls_recursion.c | 26 ----------
> > > .../selftests/bpf/progs/task_ls_recursion.c | 17 ------
> > > 5 files changed, 41 insertions(+), 90 deletions(-)
> > >
> > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> > > index 173ec7f43ed1..dcddb0aef7d8 100644
> > > --- a/include/linux/bpf_local_storage.h
> > > +++ b/include/linux/bpf_local_storage.h
> > > @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
> > > struct bpf_local_storage_cache *cache,
> > > bool bpf_ma);
> > >
> > > -struct bpf_local_storage_data *
> > > +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
> > > + struct bpf_local_storage_map *smap,
> > > + struct bpf_local_storage_elem *selem);
> > > +/* If cacheit_lockit is false, this lookup function is lockless */
> > > +static inline struct bpf_local_storage_data *
> > > bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
> > > struct bpf_local_storage_map *smap,
> > > - bool cacheit_lockit);
> > > + bool cacheit_lockit)
> > > +{
> > > + struct bpf_local_storage_data *sdata;
> > > + struct bpf_local_storage_elem *selem;
> > > +
> > > + /* Fast path (cache hit) */
> > > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx],
> > > + bpf_rcu_lock_held());
> > > + if (sdata && rcu_access_pointer(sdata->smap) == smap)
> > > + return sdata;
> >
> > I think we should focus on fast path (your v1 patch)
> > and I suppose most production environments
> > want to hit fast path in most times. In your production environment did
> > you see more than 16 local storage maps per entity (task/sk/inode)?
>
> I think having more than 16 local storage maps isn't entirely unlikely
> as eBPF usage grows. But at the moment, it should be rare.
>
> > In the fast path, the memory accesses are
> > two from local_storage->cache[smap->cache_idx] and
> > one from sdata->smap
> >
> >
> > > +
> > > + /* Slow path (cache miss) */
> > > + hlist_for_each_entry_rcu(selem, &local_storage->list, snode,
> > > + rcu_read_lock_trace_held())
> > > + if (rcu_access_pointer(SDATA(selem)->smap) == smap)
> > > + break;
> >
> > But if we reach slow path here which means we have more than 16 local
> > storage maps, then traversing the list and getting SDATA(selem)->smap
> > will be very expensive, in addition to memory accesses in fast path.
> >
> > I suppose here we mostly care about socket local storage since it is
> > totally possible for a production workload to have millions of sockets.
> > To improve performance, fast path should hit in most cases.
> > If there are too many sk local storage maps, some kind of sharing
> > can be done so multiple applications might be using a single sk
> > local storage.
> >
> > Your above inlining/outlining analysis also show how tricky it is
> > for compilation optimization. Without profiling, it is totally
> > possible that compiler might do optimization differently in
> > the future.
>
> Sure, but it's usually the case that we have to help the compiler a
> little to produce more optimal code - if the compiler becomes stupid
> in future, we need either fix the compiler or help it some more.
>
> > So here is my suggestion, let us do inlining
> > for fast path and focus on performance of fast path.
>
> The slow-path (iterate list w/o cache insertion) is still relatively
> small (it's a pointer-chasing loop and a compare), and I decided that
> it can be justified inlining it. Martin asked in v1 why there were
> slowdowns above 16 local maps, and I analyzed, and concluded that
> inlining most is needed to fix and does not hurt performance: in fact,
> the current version is better than v1 in all cases (even for 16 maps
> or below).
>
> Let me know which version you prefer, and I'll change it. However,
> based on the results, I would prefer the current version.

FTR, these were the results going from v1 (before) -> v2 (after):

+---- Local Storage ----------------------
|
| + num_maps: 1
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%)
| +- hits latency | 25.913 ns/op | 25.598 ns/op (-1.2%)
| +- important_hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%)
| +- hits latency | 22.521 ns/op | 22.259 ns/op (-1.2%)
| +- important_hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%)
|
| + num_maps: 10
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 37.583 M ops/s | 38.099 M ops/s (+1.4%)
| +- hits latency | 26.609 ns/op | 26.248 ns/op (-1.4%)
| +- important_hits throughput | 3.758 M ops/s | 3.810 M ops/s (+1.4%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 40.698 M ops/s | 41.145 M ops/s (+1.1%)
| +- hits latency | 24.573 ns/op | 24.307 ns/op (-1.1%)
| +- important_hits throughput | 14.535 M ops/s | 14.695 M ops/s (+1.1%)
|
| + num_maps: 16
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 38.061 M ops/s | 38.341 M ops/s ( ~ )
| +- hits latency | 26.275 ns/op | 26.083 ns/op ( ~ )
| +- important_hits throughput | 2.379 M ops/s | 2.396 M ops/s ( ~ )
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 40.890 M ops/s | 41.338 M ops/s (+1.1%)
| +- hits latency | 24.458 ns/op | 24.193 ns/op (-1.1%)
| +- important_hits throughput | 13.010 M ops/s | 13.153 M ops/s (+1.1%)
|
| + num_maps: 17
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 31.799 M ops/s | 32.756 M ops/s (+3.0%)
| +- hits latency | 31.448 ns/op | 30.530 ns/op (-2.9%)
| +- important_hits throughput | 1.873 M ops/s | 1.929 M ops/s (+3.0%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 35.284 M ops/s | 36.110 M ops/s (+2.3%)
| +- hits latency | 28.343 ns/op | 27.697 ns/op (-2.3%)
| +- important_hits throughput | 10.742 M ops/s | 10.993 M ops/s (+2.3%)
|
| + num_maps: 24
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 17.947 M ops/s | 19.937 M ops/s (+11.1%)
| +- hits latency | 55.725 ns/op | 50.166 ns/op (-10.0%)
| +- important_hits throughput | 0.748 M ops/s | 0.831 M ops/s (+11.1%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 21.379 M ops/s | 23.332 M ops/s (+9.1%)
| +- hits latency | 46.775 ns/op | 42.865 ns/op (-8.4%)
| +- important_hits throughput | 6.014 M ops/s | 6.564 M ops/s (+9.1%)
|
| + num_maps: 32
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 13.279 M ops/s | 14.626 M ops/s (+10.1%)
| +- hits latency | 75.317 ns/op | 68.381 ns/op (-9.2%)
| +- important_hits throughput | 0.416 M ops/s | 0.458 M ops/s (+10.2%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 16.444 M ops/s | 17.906 M ops/s (+8.9%)
| +- hits latency | 60.816 ns/op | 55.865 ns/op (-8.1%)
| +- important_hits throughput | 4.590 M ops/s | 4.998 M ops/s (+8.9%)
|
| + num_maps: 100
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 4.912 M ops/s | 5.528 M ops/s (+12.5%)
| +- hits latency | 207.291 ns/op | 183.059 ns/op (-11.7%)
| +- important_hits throughput | 0.049 M ops/s | 0.055 M ops/s (+12.7%)
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 6.039 M ops/s | 6.498 M ops/s (+7.6%)
| +- hits latency | 167.325 ns/op | 152.877 ns/op (-8.6%)
| +- important_hits throughput | 1.577 M ops/s | 1.697 M ops/s (+7.6%)
|
| + num_maps: 1000
| : <before> | <after>
| +-+ local_storage cache sequential get +----------------------+----------------------
| +- hits throughput | 0.342 M ops/s | 0.354 M ops/s (+3.6%)
| +- hits latency | 2930.550 ns/op | 2827.139 ns/op (-3.5%)
| +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ )
| :
| : <before> | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
| +- hits throughput | 0.413 M ops/s | 0.403 M ops/s (-2.5%)
| +- hits latency | 2427.830 ns/op | 2487.555 ns/op (+2.5%)
| +- important_hits throughput | 0.104 M ops/s | 0.101 M ops/s (-2.6%)
|
| Geomean:
| hits throughput: 102.93%
| hits latency: 97.11%
| important_hits throughput: 102.77%