Re: [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree

From: Paul E. McKenney
Date: Thu Feb 26 2015 - 14:41:57 EST


On Thu, Feb 26, 2015 at 08:13:35PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 26, 2015 at 10:28:17AM -0800, Paul E. McKenney wrote:
>
>
> > Color me confused, both by the existing code and the modifications.
> >
> > It appears that you are using seqlock to force readers to retry when
> > a concurrent update occurs, but I don't see what is ensuring that the
> > readers see good data when racing with an insertion or a deletion.
>
> Yes, I suppose that needs a wee bit more work (and comments).
>
> > Yes,
> > the reader will be retried, courtesy of the seqlock, but that won't help
> > if the reader segfaults before it gets to the read_seqcount_retry().
>
> Right, I don't think that can happen, but see below.
>
> > > +static void __tree_insert(struct latch_tree_root *mod_tree, struct module *mod, int idx)
> > > +{
> > > + struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > > + struct module_node *mn = &mod->tree_node[idx];
> > > + struct rb_node **link = &root->rb_node;
> > > + struct rb_node *parent = NULL;
> > > + unsigned long mod_val, m_val;
> > > + struct module *m;
> > > + int i;
> > > +
> > > + mn->mod = mod;
> > > + mod_val = mod_value(mod, idx);
> > > +
> > > + while (*link) {
> > > + parent = *link;
> > > + m = mod_entry(parent);
> > > + i = mod_node_idx(m, parent);
> > > + m_val = mod_value(m, i);
> > > +
> > > + if (mod_val < m_val)
> > > + link = &parent->rb_left;
> > > + else
> > > + link = &parent->rb_right;
> > > + }
> > > +
> > > + rb_link_node(&mn->node, parent, link);
> >
> > This makes the new module visible to readers, if I understand correctly.
>
> You do.

Whew! ;-)

> > There needs to be a memory barrier between initialization and this call
> > to rb_link_node(), otherwise, both the CPU (well, for Alpha) and the
> > compiler (for everyone) can reorder, which could result in some hapless
> > reader seeing pre-initialized data.
> >
> > Or did I miss the memory barrier?
>
> I was going with the one in list_add_rcu() and the two in
> raw_write_seqcount_latch(), however, now that you made me look at it, I
> need one in here to ensure mn->mod is observed.
>
> So yes. The best solution might be a mod_tree_init() which initializes
> those back pointers before this insertion.

OK, sounds good.

> > > + rb_insert_color(&mn->node, root);
> >
> > This -might- be OK -- the rotations do not make new nodes visible,
> > instead simply rearranging nodes that are already visible.
> >
> > For both rb_link_node() and rb_insert_color(), given that we are updating
> > pointers while readers are traversing them, READ_ONCE() and WRITE_ONCE()
> > would be good. Yes, the compiler -probably- doesn't mess you up, but it
> > would be completely within its rights to do so. :-/
> >
> > Alpha would like either rcu_dereference() or lockless_dereference()
> > instead of READ_ONCE(), of course!
>
> I _think_ we should be ok here. Since as you say, we only reorder. If we
> observe nodes in the wrong order, our traversal gets confused and might
> return the wrong or no node.
>
> That is fine.

And seeing an old pointer for some time might cause the CPU to loop,
but the update should eventually make its way through the cache
hierarchy, which should allow the CPU to find its way out. So agreed.

> Of course, as you say, the compiler is within its right to split stores
> and do other creative nonsense without the volatile thing. But I don't
> think it will actually do so for native word sized thingies.
>
> If we're really paranoid I suppose I can go make copies of the
> functions that have READ/WRITE_ONCE() in.

Well, I am really paranoid, as you know. ;-)

> As to Alpha, I think we're good there, the split cache would make us see
> stale data, which is perfectly fine, we'll traverse a crap tree, no
> problem. As long as native sized and native aligned loads/stores aren't
> split -- which I'm pretty sure doesn't happen (because of the split
> cache).

I didn't see anything that would cause a pointer to span two cache lines.
(Of course, that would break everywhere, not just on Alpha.)

> > > +}
> > > +
> > > +static void __tree_remove(struct latch_tree_root *mod_tree, struct module *mod, int idx)
> > > +{
> > > + struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > > + struct module_node *mn = &mod->tree_node[idx];
> > > +
> > > + rb_erase(&mn->node, root);
> >
> > This is just removing and not freeing, so OK from a read-side viewpoint.
> > WRITE_ONCE() would be good.
>
> Yah, would need a copy of lib/rbtree.c again :/

Or just have the *_ONCE() calls in the single version. I bet that there
is no visible performance loss.

> > > +static struct module *__tree_find(struct rb_root *r, unsigned long addr)
> > > +{
> > > + struct rb_node *n = r->rb_node;
> > > + unsigned long m_val, m_size;
> > > +
> > > + while (n) {
> > > + struct module *m = mod_entry(n);
> > > + int idx = mod_node_idx(m, n);
> > > +
> > > + m_val = mod_value(m, idx);
> > > + m_size = mod_size(m, idx);
> > > +
> > > + if (addr < m_val)
> > > + n = n->rb_left;
> >
> > We need a rcu_dereference() or lockless_dereference() here, I think.
> >
> > > + else if (addr >= m_val + m_size)
> > > + n = n->rb_right;
> >
> > And here.
>
> As per the above argument; without doing the whole READ/WRITE_ONCE for
> the rb tree primitives, I think we're fine. We don't actually need the
> read_barrier_depends() I think.
>
> I think this because raw_read_seqcount() will issue an rmb, which should
> be enough for the 'stable' tree. For the unstable one we don't care as
> long as we don't see split loads/stores.

I agree that raw_read_seqcount() will issue an rmb(), but that won't help.
For Alpha, you need the smp_rmb() to be between the load of any given
pointer and the first dereference of that same pointer.

> > > + else
> > > + return m;
> > > + }
> > > +
> > > + return NULL;
> >
> > I did finally find the RCU read-side critical section, supplied by
> > is_module_text_address()'s preempt_disable() and preempt_enable().
>
> Indeed.
>
> > The matching synchronize_sched() is supplied by do_init_module() and
> > load_module(). I expected a synchronize_sched() in free_module() as well,
> > but I only see a synchronize_rcu(). Is the required synchronize_sched()
> > on this code path hiding somewhere?
> >
> > Or did I miss an rcu_read_lock()/rcu_read_unlock() pair somewhere?
>
> No, I think someone is 'hoping' sync_rcu() implies sync_sched(). But
> yes, I should look harder at this. I was assuming the existing code was
> OK here.

I am not blaming you for the code itself, but rather just blaming you
for causing me to notice that the code was broken. ;-)

How about if I make something that allows overlapping grace periods,
so that we could be safe without latency penalty? One approach would
be a synchronize_rcu_mult() with a bitmask indicating which to wait for.
Another would be to make variants of get_state_synchronize_rcu() and
cond_synchronize_rcu() for RCU-sched as well as for RCU, but also to
make get_state_synchronize_rcu() force a grace period to start if
one was not already in progress.

Thoughts?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/