[PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks

From: Marco Elver
Date: Thu Jun 09 2022 - 07:32:00 EST

Next message: Marco Elver: "[PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent"
Previous message: Marco Elver: "[PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints"
In reply to: Marco Elver: "Re: [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints"
Next in thread: Dmitry Vyukov: "Re: [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

While optimizing task_bp_pinned()'s runtime complexity to O(1) on
average helps reduce time spent in the critical section, we still suffer
due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
that now contention is the biggest issue:

95.93% [kernel] [k] osq_lock
0.70% [kernel] [k] mutex_spin_on_owner
0.22% [kernel] [k] smp_cfm_core_cond
0.18% [kernel] [k] task_bp_pinned
0.18% [kernel] [k] rhashtable_jhash2
0.15% [kernel] [k] queued_spin_lock_slowpath

when running the breakpoint benchmark with (system with 256 CPUs):

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.207 [sec]
|
| 108.267188 usecs/op
| 6929.100000 usecs/op/cpu

The main concern for synchronizing the breakpoint constraints data is
that a consistent snapshot of the per-CPU and per-task data is observed.

The access pattern is as follows:

1. If the target is a task: the task's pinned breakpoints are counted,
checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
is used to check for conflicts with CPU-only breakpoints;
bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
unused.

2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
incremented. No per-task breakpoints are checked.

Since rhltable safely synchronizes insertions/deletions, we can allow
concurrency as follows:

1. If the target is a task: independent tasks may update and check the
constraints concurrently, but same-task target calls need to be
serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
checked, these modifications can happen concurrently by switching
tsk_pinned to atomic_t.

2. If the target is a CPU: access to the per-CPU constraints needs to
be serialized with other CPU-target and task-target callers (to
stabilize the bp_cpuinfo::tsk_pinned snapshot).

We can allow the above concurrency by introducing a per-CPU constraints
data reader-writer lock (bp_cpuinfo_lock), and per-task mutexes
(task_sharded_mtx):

1. If the target is a task: acquires its task_sharded_mtx, and
acquires bp_cpuinfo_lock as a reader.

2. If the target is a CPU: acquires bp_cpuinfo_lock as a writer.

With these changes, contention with thousands of tasks is reduced to the
point where waiting on locking no longer dominates the profile:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.080 [sec]
|
| 42.048437 usecs/op
| 2691.100000 usecs/op/cpu

21.31% [kernel] [k] task_bp_pinned
17.49% [kernel] [k] rhashtable_jhash2
5.29% [kernel] [k] toggle_bp_slot
4.45% [kernel] [k] mutex_spin_on_owner
3.72% [kernel] [k] bcmp

On this particular setup that's a speedup of 2.5x.

We're also getting closer to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- constraints accounting disabled:

| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.067 [sec]
|
| 35.286458 usecs/op
| 2258.333333 usecs/op/cpu

Which means the current implementation is ~19% slower than the
theoretical ideal.

For reference, performance without any breakpoints:

| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 0 breakpoints and 64 parallelism
| Total time: 0.060 [sec]
|
| 31.365625 usecs/op
| 2007.400000 usecs/op/cpu

The theoretical ideal is only ~12% slower than no breakpoints at all.
The current implementation is ~34% slower than no breakpoints at all.
(On a system with 256 CPUs.)

Signed-off-by: Marco Elver <elver@xxxxxxxxxx>
---
kernel/events/hw_breakpoint.c | 155 ++++++++++++++++++++++++++++------
1 file changed, 128 insertions(+), 27 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index afe0a6007e96..08c9ed0626e4 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -17,6 +17,7 @@
* This file contains the arch-independent routines.
*/

+#include <linux/atomic.h>
#include <linux/irqflags.h>
#include <linux/kallsyms.h>
#include <linux/notifier.h>
@@ -24,8 +25,10 @@
#include <linux/kdebug.h>
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/sched.h>
+#include <linux/spinlock.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/rhashtable.h>
@@ -42,9 +45,9 @@ struct bp_cpuinfo {
unsigned int cpu_pinned;
/* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
#ifdef hw_breakpoint_slots
- unsigned int tsk_pinned[hw_breakpoint_slots(0)];
+ atomic_t tsk_pinned[hw_breakpoint_slots(0)];
#else
- unsigned int *tsk_pinned;
+ atomic_t *tsk_pinned;
#endif
};

@@ -71,8 +74,81 @@ struct bp_busy_slots {
unsigned int pinned;
};

-/* Serialize accesses to the above constraints */
-static DEFINE_MUTEX(nr_bp_mutex);
+/*
+ * Synchronizes accesses to the per-CPU constraints; users of data in bp_cpuinfo
+ * must acquire bp_cpuinfo_lock as writer to get a stable snapshot of all CPUs'
+ * constraints. Modifications without use may only acquire bp_cpuinfo_lock as a
+ * reader, but must otherwise ensure modifications are never lost.
+ */
+static DEFINE_RWLOCK(bp_cpuinfo_lock);
+
+/*
+ * Synchronizes accesses to the per-task breakpoint list in task_bps_ht. Since
+ * rhltable synchronizes concurrent insertions/deletions, independent tasks may
+ * insert/delete concurrently; therefore, a mutex per task would be sufficient.
+ *
+ * To avoid bloating task_struct with infrequently used data, use a sharded
+ * mutex that scales with number of CPUs.
+ */
+static DEFINE_PER_CPU(struct mutex, task_sharded_mtx);
+
+static struct mutex *get_task_sharded_mtx(struct perf_event *bp)
+{
+ int shard;
+
+ if (!bp->hw.target)
+ return NULL;
+
+ /*
+ * Compute a valid shard index into per-CPU data.
+ */
+ shard = task_pid_nr(bp->hw.target) % nr_cpu_ids;
+ shard = cpumask_next(shard - 1, cpu_possible_mask);
+ if (shard >= nr_cpu_ids)
+ shard = cpumask_first(cpu_possible_mask);
+
+ return per_cpu_ptr(&task_sharded_mtx, shard);
+}
+
+static struct mutex *bp_constraints_lock(struct perf_event *bp)
+{
+ struct mutex *mtx = get_task_sharded_mtx(bp);
+
+ if (mtx) {
+ mutex_lock(mtx);
+ read_lock(&bp_cpuinfo_lock);
+ } else {
+ write_lock(&bp_cpuinfo_lock);
+ }
+
+ return mtx;
+}
+
+static void bp_constraints_unlock(struct mutex *mtx)
+{
+ if (mtx) {
+ read_unlock(&bp_cpuinfo_lock);
+ mutex_unlock(mtx);
+ } else {
+ write_unlock(&bp_cpuinfo_lock);
+ }
+}
+
+static bool bp_constraints_is_locked(struct perf_event *bp)
+{
+ struct mutex *mtx = get_task_sharded_mtx(bp);
+
+ return (mtx ? mutex_is_locked(mtx) : false) ||
+ rwlock_is_contended(&bp_cpuinfo_lock);
+}
+
+static inline void assert_bp_constraints_lock_held(struct perf_event *bp)
+{
+ lockdep_assert_held(&bp_cpuinfo_lock);
+ /* Don't call get_task_sharded_mtx() if lockdep is disabled. */
+ if (IS_ENABLED(CONFIG_LOCKDEP) && bp->hw.target)
+ lockdep_assert_held(get_task_sharded_mtx(bp));
+}

#ifdef hw_breakpoint_slots
/*
@@ -103,7 +179,7 @@ static __init int init_breakpoint_slots(void)
for (i = 0; i < TYPE_MAX; i++) {
struct bp_cpuinfo *info = get_bp_info(cpu, i);

- info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
+ info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(atomic_t), GFP_KERNEL);
if (!info->tsk_pinned)
goto err;
}
@@ -143,11 +219,19 @@ static inline enum bp_type_idx find_slot_idx(u64 bp_type)
*/
static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
{
- unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
+ atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
int i;

+ /*
+ * At this point we want to have acquired the bp_cpuinfo_lock as a
+ * writer to ensure that there are no concurrent writers in
+ * toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot.
+ */
+ lockdep_assert_held_write(&bp_cpuinfo_lock);
+
for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
- if (tsk_pinned[i] > 0)
+ ASSERT_EXCLUSIVE_WRITER(tsk_pinned[i]); /* Catch unexpected writers. */
+ if (atomic_read(&tsk_pinned[i]) > 0)
return i + 1;
}

@@ -164,6 +248,11 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
struct perf_event *iter;
int count = 0;

+ /*
+ * We need a stable snapshot of the per-task breakpoint list.
+ */
+ assert_bp_constraints_lock_held(bp);
+
rcu_read_lock();
head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
if (!head)
@@ -230,16 +319,25 @@ fetch_this_slot(struct bp_busy_slots *slots, int weight)
static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
enum bp_type_idx type, int weight)
{
- unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
+ atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
int old_idx, new_idx;

+ /*
+ * If bp->hw.target, tsk_pinned is only modified, but not used
+ * otherwise. We can permit concurrent updates as long as there are no
+ * other uses: having acquired bp_cpuinfo_lock as a reader allows
+ * concurrent updates here. Uses of tsk_pinned will require acquiring
+ * bp_cpuinfo_lock as a writer to stabilize tsk_pinned's value.
+ */
+ lockdep_assert_held_read(&bp_cpuinfo_lock);
+
old_idx = task_bp_pinned(cpu, bp, type) - 1;
new_idx = old_idx + weight;

if (old_idx >= 0)
- tsk_pinned[old_idx]--;
+ atomic_dec(&tsk_pinned[old_idx]);
if (new_idx >= 0)
- tsk_pinned[new_idx]++;
+ atomic_inc(&tsk_pinned[new_idx]);
}

/*
@@ -257,6 +355,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,

/* Pinned counter cpu profiling */
if (!bp->hw.target) {
+ lockdep_assert_held_write(&bp_cpuinfo_lock);
get_bp_info(bp->cpu, type)->cpu_pinned += weight;
return 0;
}
@@ -265,6 +364,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
for_each_cpu(cpu, cpumask)
toggle_bp_task_slot(bp, cpu, type, weight);

+ /*
+ * Readers want a stable snapshot of the per-task breakpoint list.
+ */
+ assert_bp_constraints_lock_held(bp);
+
if (enable)
return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
else
@@ -372,14 +476,10 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)

int reserve_bp_slot(struct perf_event *bp)
{
- int ret;
-
- mutex_lock(&nr_bp_mutex);
-
- ret = __reserve_bp_slot(bp, bp->attr.bp_type);
-
- mutex_unlock(&nr_bp_mutex);
+ struct mutex *mtx = bp_constraints_lock(bp);
+ int ret = __reserve_bp_slot(bp, bp->attr.bp_type);

+ bp_constraints_unlock(mtx);
return ret;
}

@@ -397,12 +497,11 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)

void release_bp_slot(struct perf_event *bp)
{
- mutex_lock(&nr_bp_mutex);
+ struct mutex *mtx = bp_constraints_lock(bp);

arch_unregister_hw_breakpoint(bp);
__release_bp_slot(bp, bp->attr.bp_type);
-
- mutex_unlock(&nr_bp_mutex);
+ bp_constraints_unlock(mtx);
}

static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
@@ -429,11 +528,10 @@ static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)

static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
{
- int ret;
+ struct mutex *mtx = bp_constraints_lock(bp);
+ int ret = __modify_bp_slot(bp, old_type, new_type);

- mutex_lock(&nr_bp_mutex);
- ret = __modify_bp_slot(bp, old_type, new_type);
- mutex_unlock(&nr_bp_mutex);
+ bp_constraints_unlock(mtx);
return ret;
}

@@ -444,7 +542,7 @@ static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
*/
int dbg_reserve_bp_slot(struct perf_event *bp)
{
- if (mutex_is_locked(&nr_bp_mutex))
+ if (bp_constraints_is_locked(bp))
return -1;

return __reserve_bp_slot(bp, bp->attr.bp_type);
@@ -452,7 +550,7 @@ int dbg_reserve_bp_slot(struct perf_event *bp)

int dbg_release_bp_slot(struct perf_event *bp)
{
- if (mutex_is_locked(&nr_bp_mutex))
+ if (bp_constraints_is_locked(bp))
return -1;

__release_bp_slot(bp, bp->attr.bp_type);
@@ -735,7 +833,10 @@ static struct pmu perf_breakpoint = {

int __init init_hw_breakpoint(void)
{
- int ret;
+ int cpu, ret;
+
+ for_each_possible_cpu(cpu)
+ mutex_init(&per_cpu(task_sharded_mtx, cpu));

ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
if (ret)
--
2.36.1.255.ge46751e96f-goog

Next message: Marco Elver: "[PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent"
Previous message: Marco Elver: "[PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints"
In reply to: Marco Elver: "Re: [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints"
Next in thread: Dmitry Vyukov: "Re: [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]