Re: [RFC][PATCH 5/9] rbtree: Make lockless searches non-fatal

From: Mathieu Desnoyers
Date: Sun Mar 01 2015 - 08:53:29 EST


----- Original Message -----
> From: "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>
> To: mingo@xxxxxxxxxx, rusty@xxxxxxxxxxxxxxx, "mathieu desnoyers" <mathieu.desnoyers@xxxxxxxxxxxx>, oleg@xxxxxxxxxx,
> paulmck@xxxxxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx, andi@xxxxxxxxxxxxxx, rostedt@xxxxxxxxxxx, tglx@xxxxxxxxxxxxx, peterz@xxxxxxxxxxxxx,
> "Michel Lespinasse" <walken@xxxxxxxxxx>, "Andrea Arcangeli" <aarcange@xxxxxxxxxx>, "David Woodhouse"
> <David.Woodhouse@xxxxxxxxx>, "Rik van Riel" <riel@xxxxxxxxxx>
> Sent: Saturday, February 28, 2015 4:24:52 PM
> Subject: [RFC][PATCH 5/9] rbtree: Make lockless searches non-fatal
>
> Change the insert and erase code such that lockless searches are
> non-fatal.
>
> In and of itself an rbtree cannot be correctly searched while
> in-modification, we can however provide weaker guarantees that will
> allow the rbtree to be used in conjunction with other techniques, such
> as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").
>
> For this to work we need the following guarantees from the rbtree
> code:
>
> 1) a lockless reader must not see partial stores, this would allow it
> to observe nodes that are invalid memory.
>
> 2) there must not be (temporary) loops in the tree structure in the
> modifier's program order, this would cause a lookup which
> interrupts the modifier to get stuck indefinitely.

For (2), I don't think this is how the situation should be described.

Let's consider a scenario where an interrupt nests over the modification.
First, the modification will switch the latch to the other version of the
tree. Therefore, the interrupt will see a fully consistent tree, not the
tree being modified. Therefore, a temporary loop in the tree should not
be an issue for that peculiar situation.

However, if we have another thread traversing the tree while we
concurrently switch the latch and modify one version of the tree,
creating a temporary loop in the tree, this thread could possibly:

A) deadlock with the modification if there is a locking dependency
between tree modification, tree read, and another lock (transitive
dependency).
B) have the modifier starved by the other thread, if that thread has
a higher scheduling priority (e.g. RT) than the modifier. The high
priority thread would then use all its CPU time to perform the
temporary loop.

So I agree that loops are unwanted there: it allows us to never have
to care about situations A and B. However, the explanation about why
should not involve, AFAIU, an interrupt handler nesting over the tree
modification, because this is precisely one scenario that should not
care about loops.

Thoughs ?

Thanks!

Mathieu


>
> For 1) we must use WRITE_ONCE() for all updates to the tree structure;
> in particular this patch only does rb_{left,right} as those are the
> only element required for simple searches.
>
> It generates slightly worse code, probably because gcc stinks at
> volatile. But in pointer chasing heavy code a few instructions more
> should not matter.
>
> For 2) I have carefully audited the code and drawn every intermediate
> link state and not found a loop.
>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Michel Lespinasse <walken@xxxxxxxxxx>
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: David Woodhouse <David.Woodhouse@xxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> include/linux/rbtree.h | 10 +++++
> include/linux/rbtree_augmented.h | 21 +++++++----
> lib/rbtree.c | 71
> ++++++++++++++++++++++++++-------------
> 3 files changed, 73 insertions(+), 29 deletions(-)
>
> --- a/include/linux/rbtree.h
> +++ b/include/linux/rbtree.h
> @@ -31,6 +31,7 @@
>
> #include <linux/kernel.h>
> #include <linux/stddef.h>
> +#include <linux/rcupdate.h>
>
> struct rb_node {
> unsigned long __rb_parent_color;
> @@ -85,6 +86,15 @@ static inline void rb_link_node(struct r
> *rb_link = node;
> }
>
> +static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node *
> parent,
> + struct rb_node ** rb_link)
> +{
> + node->__rb_parent_color = (unsigned long)parent;
> + node->rb_left = node->rb_right = NULL;
> +
> + rcu_assign_pointer(*rb_link, node);
> +}
> +
> #define rb_entry_safe(ptr, type, member) \
> ({ typeof(ptr) ____ptr = (ptr); \
> ____ptr ? rb_entry(____ptr, type, member) : NULL; \
> --- a/include/linux/rbtree_augmented.h
> +++ b/include/linux/rbtree_augmented.h
> @@ -123,11 +123,11 @@ __rb_change_child(struct rb_node *old, s
> {
> if (parent) {
> if (parent->rb_left == old)
> - parent->rb_left = new;
> + WRITE_ONCE(parent->rb_left, new);
> else
> - parent->rb_right = new;
> + WRITE_ONCE(parent->rb_right, new);
> } else
> - root->rb_node = new;
> + WRITE_ONCE(root->rb_node, new);
> }
>
> extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
> @@ -137,7 +137,8 @@ static __always_inline struct rb_node *
> __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
> const struct rb_augment_callbacks *augment)
> {
> - struct rb_node *child = node->rb_right, *tmp = node->rb_left;
> + struct rb_node *child = node->rb_right;
> + struct rb_node *tmp = node->rb_left;
> struct rb_node *parent, *rebalance;
> unsigned long pc;
>
> @@ -167,6 +168,7 @@ __rb_erase_augmented(struct rb_node *nod
> tmp = parent;
> } else {
> struct rb_node *successor = child, *child2;
> +
> tmp = child->rb_left;
> if (!tmp) {
> /*
> @@ -180,6 +182,7 @@ __rb_erase_augmented(struct rb_node *nod
> */
> parent = successor;
> child2 = successor->rb_right;
> +
> augment->copy(node, successor);
> } else {
> /*
> @@ -201,19 +204,23 @@ __rb_erase_augmented(struct rb_node *nod
> successor = tmp;
> tmp = tmp->rb_left;
> } while (tmp);
> - parent->rb_left = child2 = successor->rb_right;
> - successor->rb_right = child;
> + child2 = successor->rb_right;
> + WRITE_ONCE(parent->rb_left, child2);
> + WRITE_ONCE(successor->rb_right, child);
> rb_set_parent(child, successor);
> +
> augment->copy(node, successor);
> augment->propagate(parent, successor);
> }
>
> - successor->rb_left = tmp = node->rb_left;
> + tmp = node->rb_left;
> + WRITE_ONCE(successor->rb_left, tmp);
> rb_set_parent(tmp, successor);
>
> pc = node->__rb_parent_color;
> tmp = __rb_parent(pc);
> __rb_change_child(node, successor, tmp, root);
> +
> if (child2) {
> successor->__rb_parent_color = pc;
> rb_set_parent_color(child2, parent, RB_BLACK);
> --- a/lib/rbtree.c
> +++ b/lib/rbtree.c
> @@ -44,6 +44,25 @@
> * parentheses and have some accompanying text comment.
> */
>
> +/*
> + * All stores to the tree structure (rb_left and rb_right) must be done
> using
> + * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in
> the
> + * tree structure as seen in program order.
> + *
> + * These two requirements will allow lockless iteration of the tree -- not
> + * correct iteration mind you, tree rotations are not atomic so a lookup
> might
> + * miss entire subtrees.
> + *
> + * But they do guarantee that any such traversal will only see valid
> elements
> + * and that it will indeed complete -- does not get stuck in a loop.
> + *
> + * NOTE:
> + *
> + * Stores to __rb_parent_color are not important for simple lookups so those
> + * are left undone as of now. Nor did I check for loops involving parent
> + * pointers.
> + */
> +
> static inline void rb_set_black(struct rb_node *rb)
> {
> rb->__rb_parent_color |= RB_BLACK;
> @@ -129,8 +148,9 @@ __rb_insert(struct rb_node *node, struct
> * This still leaves us in violation of 4), the
> * continuation into Case 3 will fix that.
> */
> - parent->rb_right = tmp = node->rb_left;
> - node->rb_left = parent;
> + tmp = node->rb_left;
> + WRITE_ONCE(parent->rb_right, tmp);
> + WRITE_ONCE(node->rb_left, parent);
> if (tmp)
> rb_set_parent_color(tmp, parent,
> RB_BLACK);
> @@ -149,8 +169,8 @@ __rb_insert(struct rb_node *node, struct
> * / \
> * n U
> */
> - gparent->rb_left = tmp; /* == parent->rb_right */
> - parent->rb_right = gparent;
> + WRITE_ONCE(gparent->rb_left, tmp); /* == parent->rb_right */
> + WRITE_ONCE(parent->rb_right, gparent);
> if (tmp)
> rb_set_parent_color(tmp, gparent, RB_BLACK);
> __rb_rotate_set_parents(gparent, parent, root, RB_RED);
> @@ -171,8 +191,9 @@ __rb_insert(struct rb_node *node, struct
> tmp = parent->rb_left;
> if (node == tmp) {
> /* Case 2 - right rotate at parent */
> - parent->rb_left = tmp = node->rb_right;
> - node->rb_right = parent;
> + tmp = node->rb_right;
> + WRITE_ONCE(parent->rb_left, tmp);
> + WRITE_ONCE(node->rb_right, parent);
> if (tmp)
> rb_set_parent_color(tmp, parent,
> RB_BLACK);
> @@ -183,8 +204,8 @@ __rb_insert(struct rb_node *node, struct
> }
>
> /* Case 3 - left rotate at gparent */
> - gparent->rb_right = tmp; /* == parent->rb_left */
> - parent->rb_left = gparent;
> + WRITE_ONCE(gparent->rb_right, tmp); /* == parent->rb_left */
> + WRITE_ONCE(parent->rb_left, gparent);
> if (tmp)
> rb_set_parent_color(tmp, gparent, RB_BLACK);
> __rb_rotate_set_parents(gparent, parent, root, RB_RED);
> @@ -224,8 +245,9 @@ ____rb_erase_color(struct rb_node *paren
> * / \ / \
> * Sl Sr N Sl
> */
> - parent->rb_right = tmp1 = sibling->rb_left;
> - sibling->rb_left = parent;
> + tmp1 = sibling->rb_left;
> + WRITE_ONCE(parent->rb_right, tmp1);
> + WRITE_ONCE(sibling->rb_left, parent);
> rb_set_parent_color(tmp1, parent, RB_BLACK);
> __rb_rotate_set_parents(parent, sibling, root,
> RB_RED);
> @@ -275,9 +297,10 @@ ____rb_erase_color(struct rb_node *paren
> * \
> * Sr
> */
> - sibling->rb_left = tmp1 = tmp2->rb_right;
> - tmp2->rb_right = sibling;
> - parent->rb_right = tmp2;
> + tmp1 = tmp2->rb_right;
> + WRITE_ONCE(sibling->rb_left, tmp1);
> + WRITE_ONCE(tmp2->rb_right, sibling);
> + WRITE_ONCE(parent->rb_right, tmp2);
> if (tmp1)
> rb_set_parent_color(tmp1, sibling,
> RB_BLACK);
> @@ -297,8 +320,9 @@ ____rb_erase_color(struct rb_node *paren
> * / \ / \
> * (sl) sr N (sl)
> */
> - parent->rb_right = tmp2 = sibling->rb_left;
> - sibling->rb_left = parent;
> + tmp2 = sibling->rb_left;
> + WRITE_ONCE(parent->rb_right, tmp2);
> + WRITE_ONCE(sibling->rb_left, parent);
> rb_set_parent_color(tmp1, sibling, RB_BLACK);
> if (tmp2)
> rb_set_parent(tmp2, parent);
> @@ -310,8 +334,9 @@ ____rb_erase_color(struct rb_node *paren
> sibling = parent->rb_left;
> if (rb_is_red(sibling)) {
> /* Case 1 - right rotate at parent */
> - parent->rb_left = tmp1 = sibling->rb_right;
> - sibling->rb_right = parent;
> + tmp1 = sibling->rb_right;
> + WRITE_ONCE(parent->rb_left, tmp1);
> + WRITE_ONCE(sibling->rb_right, parent);
> rb_set_parent_color(tmp1, parent, RB_BLACK);
> __rb_rotate_set_parents(parent, sibling, root,
> RB_RED);
> @@ -336,9 +361,10 @@ ____rb_erase_color(struct rb_node *paren
> break;
> }
> /* Case 3 - right rotate at sibling */
> - sibling->rb_right = tmp1 = tmp2->rb_left;
> - tmp2->rb_left = sibling;
> - parent->rb_left = tmp2;
> + tmp1 = tmp2->rb_left;
> + WRITE_ONCE(sibling->rb_right, tmp1);
> + WRITE_ONCE(tmp2->rb_left, sibling);
> + WRITE_ONCE(parent->rb_left, tmp2);
> if (tmp1)
> rb_set_parent_color(tmp1, sibling,
> RB_BLACK);
> @@ -347,8 +373,9 @@ ____rb_erase_color(struct rb_node *paren
> sibling = tmp2;
> }
> /* Case 4 - left rotate at parent + color flips */
> - parent->rb_left = tmp2 = sibling->rb_right;
> - sibling->rb_right = parent;
> + tmp2 = sibling->rb_right;
> + WRITE_ONCE(parent->rb_left, tmp2);
> + WRITE_ONCE(sibling->rb_right, parent);
> rb_set_parent_color(tmp1, sibling, RB_BLACK);
> if (tmp2)
> rb_set_parent(tmp2, parent);
>
>
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/