Re: [RFC 00/12] locking: Separate lock tracepoints from lockdep/lock_stat (v1)

From: Namhyung Kim
Date: Tue Feb 08 2022 - 14:14:25 EST


Oops, I used the wrong email address of Paul. Sorry about that!
I'll resend with a new address soon.

Thanks,
Namhyung


On Tue, Feb 8, 2022 at 10:42 AM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> Hello,
>
> There have been some requests for low-overhead kernel lock contention
> monitoring. The kernel has CONFIG_LOCK_STAT to provide such an infra
> either via /proc/lock_stat or tracepoints directly.
>
> However it's not light-weight and hard to be used in production. So
> I'm trying to separate out the tracepoints and using them as a base to
> build a new monitoring system.
>
> As the lockdep and lock_stat provide good hooks in the lock functions,
> it'd be natural to reuse them. Actually I tried to use lockdep as is
> but disables the functionality at runtime (initialize debug_locks = 0,
> lock_stat = 0). But it still has unacceptable overhead and the
> lockdep data structures also increase memory footprint unnecessarily.
>
> So I'm proposing a separate tracepoint-only configuration and keeping
> lockdep_map only with minimal information needed for tracepoints (for
> now, it's just name). And then the existing lockdep hooks can be used
> for the tracepoints.
>
> The patch 01-06 are preparation for the work. In a few places in the
> kernel, they calls lockdep annotation explicitly to deal with
> limitations in the lockdep implementation. In my understanding, they
> are not needed to analyze lock contention.
>
> To make matters worse, they rely on the compiler optimization (or
> macro magic) so that it can get rid of the annotations and their
> arguments when lockdep is not configured.
>
> But it's not true any more when lock tracepoints are added and it'd
> cause build errors. So I added #ifdef guards for LOCKDEP in the code
> to prevent such errors.
>
> In the patch 07 I mechanically changed most of code that depend on
> CONFIG_LOCKDEP or CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCK_INFO. It
> paves the way for the codes to be shared for lockdep and tracepoints.
> Mostly, it makes sure that locks are initialized with a proper name,
> like in the patch 08 and 09.
>
> I didn't change some places intentionally - for example, timer and
> workqueue depend on LOCKDEP explicitly since they use some lockdep
> annotations to work around the "held lock freed" warnings. The ocfs2
> directly accesses lockdep_map.key so I didn't touch the code for now.
> And RCU was because it generates too much overhead thanks to the
> rcu_read_lock(). Maybe I need to revisit some of them later.
>
> I added CONFIG_LOCK_TRACEPOINTS in the patch 10 to make it optional.
> I found that it adds 2~3% of overhead when I ran `perf bench sched
> messaging` even when the tracepoints are disabled. The benchmark
> creates a lot of processes and make them communicate by socket pairs
> (or pipes). I measured that around 15% of lock acquisition creates
> contentions and it's mostly for spin locks (alc->lock and u->lock).
>
> I ran perf record + report with the workload and it showed 50% of the
> cpu cycles are in the spin lock slow path. So it's highly affected by
> the spin lock slow path. Actually LOCK_CONTENDED() macro transforms
> the spin lock code (and others) to use trylock first and then fall
> back to real lock function if failed. Thus it'd add more (atomic)
> operations and cache line bouncing for the contended cases:
>
> #define LOCK_CONTENDED(_lock, try, lock) \
> do { \
> if (!try(_lock)) { \
> lock_contended(&(_lock)->dep_map, _RET_IP_); \
> lock(_lock); \
> } \
> lock_acquired(&(_lock)->dep_map, _RET_IP_); \
> } while (0)
>
> If I modify the macro not to use trylock and to call the real lock
> function directly (so the lock_contended tracepoint would be called
> always, if enabled), the overhead goes down to 0.x% when the
> tracepoints are disabled.
>
> I don't have a good solution as long as we use LOCK_CONTENDED() macro
> to separate the contended locking path. Maybe we can make it call
> some (generic) slow path lock function directly after failing trylock.
> Or move the lockdep annotations into the each lock function bodies and
> get rid of the LOCK_CONTENDED() macro entirely. Ideas?
>
> Actually the patch 11 handles the same issue on the mutex code. The
> fast version of mutex trylock was attempted only if LOCKDEP is not
> enabled and it affects the mutex lock performance in the uncontended
> cases too. So I partially reverted the change in the patch 07 to use
> the fast functions with lock tracepoints too. Maybe we can use it
> with LOCKDEP as well?
>
> The last patch 12 might be controversial and I'd like to move the
> lock_acquired annotation into the if(!try) block in the LOCK_CONTEDED
> macro so that it can only be called when there's a contention.
>
> Eventually I'm mostly interested in the contended locks only and I
> want to reduce the overhead in the fast path. By moving that, it'd be
> easy to track contended locks with timing by using two tracepoints.
>
> It'd change lock hold time calculation in lock_stat for the fast path,
> but I assume time difference between lock_acquire and lock_acquired
> would be small when the lock is not contended. So I think we can use
> the timestamp in lock_acquire. If it's not acceptable, we might
> consider adding a new tracepoint to track the timing of contended
> locks.
>
> This series base on the current tip/locking/core and you get it from
> 'locking/tracepoint-v1' branch in my tree at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
>
> Thanks,
> Namhyung
>
>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Byungchul Park <byungchul.park@xxxxxxx>
> Cc: rcu@xxxxxxxxxxxxxxx
> Cc: cgroups@xxxxxxxxxxxxxxx
> Cc: linux-btrfs@xxxxxxxxxxxxxxx
> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx
>
>
> Namhyung Kim (12):
> locking: Pass correct outer wait type info
> cgroup: rstat: Make cgroup_rstat_cpu_lock name readable
> timer: Protect lockdep functions with #ifdef
> workqueue: Protect lockdep functions with #ifdef
> drm/i915: Protect lockdep functions with #ifdef
> btrfs: change lockdep class size check using ks->names
> locking: Introduce CONFIG_LOCK_INFO
> locking/mutex: Init name properly w/ CONFIG_LOCK_INFO
> locking: Add more static lockdep init macros
> locking: Add CONFIG_LOCK_TRACEPOINTS option
> locking/mutex: Revive fast functions for LOCK_TRACEPOINTS
> locking: Move lock_acquired() from the fast path
>
> drivers/gpu/drm/drm_connector.c | 7 +-
> drivers/gpu/drm/i915/i915_sw_fence.h | 2 +-
> drivers/gpu/drm/i915/intel_wakeref.c | 3 +
> drivers/gpu/drm/i915/selftests/lib_sw_fence.h | 2 +-
> .../net/wireless/intel/iwlwifi/iwl-trans.c | 4 +-
> .../net/wireless/intel/iwlwifi/iwl-trans.h | 2 +-
> drivers/tty/tty_ldsem.c | 2 +-
> fs/btrfs/disk-io.c | 4 +-
> fs/btrfs/disk-io.h | 2 +-
> fs/cifs/connect.c | 2 +-
> fs/kernfs/file.c | 2 +-
> include/linux/completion.h | 2 +-
> include/linux/jbd2.h | 2 +-
> include/linux/kernfs.h | 2 +-
> include/linux/kthread.h | 2 +-
> include/linux/local_lock_internal.h | 18 +-
> include/linux/lockdep.h | 170 ++++++++++++++++--
> include/linux/lockdep_types.h | 8 +-
> include/linux/mmu_notifier.h | 2 +-
> include/linux/mutex.h | 12 +-
> include/linux/percpu-rwsem.h | 4 +-
> include/linux/regmap.h | 4 +-
> include/linux/rtmutex.h | 14 +-
> include/linux/rwlock_api_smp.h | 4 +-
> include/linux/rwlock_rt.h | 4 +-
> include/linux/rwlock_types.h | 11 +-
> include/linux/rwsem.h | 14 +-
> include/linux/seqlock.h | 4 +-
> include/linux/spinlock_api_smp.h | 4 +-
> include/linux/spinlock_rt.h | 4 +-
> include/linux/spinlock_types.h | 4 +-
> include/linux/spinlock_types_raw.h | 28 ++-
> include/linux/swait.h | 2 +-
> include/linux/tty_ldisc.h | 2 +-
> include/linux/wait.h | 2 +-
> include/linux/ww_mutex.h | 6 +-
> include/media/v4l2-ctrls.h | 2 +-
> include/net/sock.h | 2 +-
> include/trace/events/lock.h | 4 +-
> kernel/cgroup/rstat.c | 7 +-
> kernel/locking/Makefile | 1 +
> kernel/locking/lockdep.c | 40 ++++-
> kernel/locking/mutex-debug.c | 2 +-
> kernel/locking/mutex.c | 22 ++-
> kernel/locking/mutex.h | 7 +
> kernel/locking/percpu-rwsem.c | 2 +-
> kernel/locking/rtmutex_api.c | 10 +-
> kernel/locking/rwsem.c | 4 +-
> kernel/locking/spinlock.c | 2 +-
> kernel/locking/spinlock_debug.c | 4 +-
> kernel/locking/spinlock_rt.c | 8 +-
> kernel/locking/ww_rt_mutex.c | 2 +-
> kernel/printk/printk.c | 14 +-
> kernel/rcu/update.c | 27 +--
> kernel/time/timer.c | 8 +-
> kernel/workqueue.c | 13 ++
> lib/Kconfig.debug | 14 ++
> mm/memcontrol.c | 7 +-
> mm/mmu_notifier.c | 2 +-
> net/core/dev.c | 2 +-
> net/sunrpc/svcsock.c | 2 +-
> net/sunrpc/xprtsock.c | 2 +-
> 62 files changed, 391 insertions(+), 180 deletions(-)
>
>
> base-commit: 1dc01abad6544cb9d884071b626b706e37aa9601
> --
> 2.35.0.263.gb82422642f-goog
>