Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition

From: Vlastimil Babka
Date: Tue Oct 20 2020 - 10:50:34 EST


On 10/10/20 12:05 AM, Axel Rasmussen wrote:
The goal of these tracepoints is to be able to debug lock contention
issues. This lock is acquired on most (all?) mmap / munmap / page fault
operations, so a multi-threaded process which does a lot of these can
experience significant contention.

We trace just before we start acquisition, when the acquisition returns
(whether it succeeded or not), and when the lock is released (or
downgraded). The events are broken out by lock type (read / write).

The events are also broken out by memcg path. For container-based
workloads, users often think of several processes in a memcg as a single
logical "task", so collecting statistics at this level is useful.

The end goal is to get latency information. This isn't directly included
in the trace events. Instead, users are expected to compute the time
between "start locking" and "acquire returned", using e.g. synthetic
events or BPF. The benefit we get from this is simpler code.

Because we use tracepoint_enabled() to decide whether or not to trace,
this patch has effectively no overhead unless tracepoints are enabled at
runtime. If tracepoints are enabled, there is a performance impact, but
how much depends on exactly what e.g. the BPF program does.

Signed-off-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>

Yeah I agree with this approach that follows the page ref one.

...

diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
new file mode 100644
index 000000000000..b849287bd12a
--- /dev/null
+++ b/mm/mmap_lock.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmap_lock.h>
+
+#include <linux/mm.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/mmap_lock.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <linux/trace_events.h>
+
+/*
+ * We have to export these, as drivers use mmap_lock, and our inline functions
+ * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
+ * the nvidia driver is an existing caller of this code.

I don't think this argument works in the kernel community. I would just remove this comment.

+ */
+EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_released);

You can use EXPORT_TRACEPOINT_SYMBOL() here.

+#ifdef CONFIG_MEMCG
+
+DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
+
+/*
+ * Write the given mm_struct's memcg path to a percpu buffer, and return a
+ * pointer to it. If the path cannot be determined, the buffer will contain the
+ * empty string.
+ *
+ * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
+ * disabled by the caller before calling us, and re-enabled only after the
+ * caller is done with the pointer.
+ */
+static const char *get_mm_memcg_path(struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
+ if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
+ char *buf = this_cpu_ptr(trace_memcg_path);
+
+ cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
+ return buf;
+ }
+ return "";
+}
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
+ do { \
+ if (trace_mmap_lock_##type##_enabled()) { \

Is this check really needed? We only got called from the functions inlined in the .h file because tracepoint_enabled() was true in the first place, so this seems redundant.

+ get_cpu(); \
+ trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), \
+ ##__VA_ARGS__); \
+ put_cpu(); \
+ } \
+ } while (0)
+
+#else /* !CONFIG_MEMCG */
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
+ trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
+
+#endif /* CONFIG_MEMCG */
+
+/*
+ * Trace calls must be in a separate file, as otherwise there's a circular
+ * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
+ */
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
+{
+ TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);

Seems wasteful to have an always-true success field here. Yeah, not reusing the same event class for all three tracepoints means more code, but for tracing efficiency it's worth it, IMHO.

+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
+
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+ bool success)
+{
+ TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
+
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
+{
+ TRACE_MMAP_LOCK_EVENT(released, mm, write, true);

Ditto.

+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_released);