> In various performance profiles of kernels with BPF programs attached,
> bpf_local_storage_lookup() appears as a significant portion of CPU
> cycles spent. To enable the compiler generate more optimal code, turn
> bpf_local_storage_lookup() into a static inline function, where only the
> cache insertion code path is outlined
> Notably, outlining cache insertion helps avoid bloating callers by
> duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
> architectures which do not inline spin_lock/unlock, such as x86), which
> would cause the compiler produce worse code by deciding to outline
> otherwise inlinable functions. The call overhead is neutral, because we
> make 2 calls either way: either calling raw_spin_lock_irqsave() and
> raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
> which calls raw_spin_lock_irqsave(), followed by a tail-call to
> raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
> optimized uninstrumented builds) turns it into a plain jump. The call to
> __bpf_local_storage_insert_cache() can be elided entirely if
> cacheit_lockit is a false constant expression.
> [...]

