Re: [PATCH] uprobes: reduce contention on uprobes_tree access

From: Andrii Nakryiko
Date: Tue Mar 26 2024 - 12:02:12 EST


On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Thu, 21 Mar 2024 07:57:35 -0700
> Jonathan Haslam <jonathan.haslam@xxxxxxxxx> wrote:
>
> > Active uprobes are stored in an RB tree and accesses to this tree are
> > dominated by read operations. Currently these accesses are serialized by
> > a spinlock but this leads to enormous contention when large numbers of
> > threads are executing active probes.
> >
> > This patch converts the spinlock used to serialize access to the
> > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > aligns naturally with the overwhelmingly read-only nature of the tree
> > usage here. Although the addition of reader-writer spinlocks are
> > discouraged [0], this fix is proposed as an interim solution while an
> > RCU based approach is implemented (that work is in a nascent form). This
> > fix also has the benefit of being trivial, self contained and therefore
> > simple to backport.
> >
> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>

Masami,

Given the discussion around per-cpu rw semaphore and need for
(internal) batched attachment API for uprobes, do you think you can
apply this patch as is for now? We can then gain initial improvements
in scalability that are also easy to backport, and Jonathan will work
on a more complete solution based on per-cpu RW semaphore, as
suggested by Ingo.

>
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.
>
> Thank you,
>
> >
> > [0] https://docs.kernel.org/locking/spinlocks.html
> >
> > Signed-off-by: Jonathan Haslam <jonathan.haslam@xxxxxxxxx>
> > ---
> > kernel/events/uprobes.c | 22 +++++++++++-----------
> > 1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > */
> > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> >
> > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> >
> > #define UPROBES_HASH_SZ 13
> > /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > {
> > struct uprobe *uprobe;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > uprobe = __find_uprobe(inode, offset);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return uprobe;
> > }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > {
> > struct uprobe *u;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > u = __insert_uprobe(uprobe);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> >
> > return u;
> > }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > if (WARN_ON(!uprobe_is_active(uprobe)))
> > return;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > put_uprobe(uprobe);
> > }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > if (n) {
> > for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > get_uprobe(u);
> > }
> > }
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> > }
> >
> > /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return !!n;
> > }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>