Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

From: Paul E. McKenney
Date: Mon Oct 09 2017 - 19:15:28 EST


On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument. In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
>
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> >
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
>
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.

That would be very good!

> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.

Agreed, if we pull too much stuff into the model it might well become
difficult to maintain in and of itself. So we do need to choose
carefully based on model performance and ease of use.

For a performance example, note that locking can be easily modeled with
xchg_acquire() for lock acquisition and smp_store_release() for lock
release. But here is the resulting performance compared to directly
modeling locking within the cat code:

xchg_acquire() and
# Processes smp_store_release() Direct modeling

2 0.061 s 0.007 s

3 3.542 s 0.039 s

4 569.189 s 0.397 s

5 > 113,853.000 s 5.187 s

Combinatorial explosion is a real pain... The problem is that the
xchg_acquire() / smp_store_release() approach requires the herd7 tool
to consider and then reject all the scenarios where the lock critical
sections overlap. In contrast, when direct modeling, those overlapping
critical-section scenarios are implicitly rejected by the cat-code that
models locking. Don't get me wrong, there is still combinatorial
explosion, as you can see from the rightmost column of numbers.
After all, the herd7 tool still needs to consider all possible
lock-acquisition orders. But the explosion is much less explosive. ;-)

Given that we expect locking to be very heavily used, it makes a lot
of sense to directly model it, which we (mostly Alan, Andrea, and Luc)
have done. Direct modeling of RCU gets similar speedups over emulation,
plus the fact that accurate emulation of RCU requires some surprisingly
unnatural acts. ("Ghost" accesses unaffected by normal memory-ordering
primitives, and "ghost-buster memory barriers" that affect both "ghost"
and normal accesses. Plus the transformation is complex, so much so
that I broke down and wrote a script, which was used to validate the
early versions of the RCU model.)

In constrast, I have no reason to believe that direct modeling could make
wait_for_completion() / complete() go any faster than the equivalent
setup using smp_store_release() / smp_load_acquire(), that is, I don't
see anything that could be optimized away byt cat-code. So the only
reason to directly model wait_for_completion() / complete() would be
to get rid of the need to add the extra term to the "exists" clause.
And agreed, we should wait until people are actually being significantly
inconvenienced by this, which I do not expect any time soon.

> > > Why not? In what sort of cases does it go wobbly?
> >
> > For one, when it conflicts with maintainability. For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> >
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain. On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place. ;-)
>
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Agreed, we should be careful how much we promise lest we get locked
into needlessly heavyweight implementations.

> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..

Hear, hear! ;-)

Thanx, Paul