Re: [RFC 0/2] srcu: Remove pre-flip memory barrier

From: Paul E. McKenney
Date: Thu Dec 22 2022 - 13:54:30 EST


On Thu, Dec 22, 2022 at 01:19:06PM -0500, Joel Fernandes wrote:
>
>
> > On Dec 22, 2022, at 11:43 AM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 22, 2022 at 01:40:10PM +0100, Frederic Weisbecker wrote:
> >>> On Wed, Dec 21, 2022 at 12:11:42PM -0500, Mathieu Desnoyers wrote:
> >>> On 2022-12-21 06:59, Frederic Weisbecker wrote:
> >>>>> On Tue, Dec 20, 2022 at 10:34:19PM -0500, Mathieu Desnoyers wrote:
> >>> [...]
> >>>>>
> >>>>> The memory ordering constraint I am concerned about here is:
> >>>>>
> >>>>> * [...] In addition,
> >>>>> * each CPU having an SRCU read-side critical section that extends beyond
> >>>>> * the return from synchronize_srcu() is guaranteed to have executed a
> >>>>> * full memory barrier after the beginning of synchronize_srcu() and before
> >>>>> * the beginning of that SRCU read-side critical section. [...]
> >>>>>
> >>>>> So if we have a SRCU read-side critical section that begins after the beginning
> >>>>> of synchronize_srcu, but before its first memory barrier, it would miss the
> >>>>> guarantee that the full memory barrier is issued before the beginning of that
> >>>>> SRCU read-side critical section. IOW, that memory barrier needs to be at the
> >>>>> very beginning of the grace period.
> >>>>
> >>>> I'm confused, what's wrong with this ?
> >>>>
> >>>> UPDATER READER
> >>>> ------- ------
> >>>> STORE X = 1 STORE srcu_read_lock++
> >>>> // rcu_seq_snap() smp_mb()
> >>>> smp_mb() READ X
> >>>> // scans
> >>>> READ srcu_read_lock
> >>>
> >>> What you refer to here is only memory ordering of the store to X and load
> >>> from X wrt loading/increment of srcu_read_lock, which is internal to the
> >>> srcu implementation. If we really want to model the provided high-level
> >>> memory ordering guarantees, we should consider a scenario where SRCU is used
> >>> for its memory ordering properties to synchronize other variables.
> >>>
> >>> I'm concerned about the following Dekker scenario, where synchronize_srcu()
> >>> and srcu_read_lock/unlock would be used instead of memory barriers:
> >>>
> >>> Initial state: X = 0, Y = 0
> >>>
> >>> Thread A Thread B
> >>> ---------------------------------------------
> >>> STORE X = 1 STORE Y = 1
> >>> synchronize_srcu()
> >>> srcu_read_lock()
> >>> r1 = LOAD X
> >>> srcu_read_unlock()
> >>> r0 = LOAD Y
> >>>
> >>> BUG_ON(!r0 && !r1)
> >>>
> >>> So in the synchronize_srcu implementation, there appears to be two
> >>> major scenarios: either srcu_gp_start_if_needed starts a gp or expedited gp,
> >>> or it uses an already started gp/expedited gp. When snapshotting with
> >>> rcu_seq_snap, the fact that the memory barrier is after the ssp->srcu_gp_seq
> >>> load means that it does not order prior memory accesses before that load.
> >>> This sequence value is then used to identify which gp_seq to wait for when
> >>> piggy-backing on another already-started gp. I worry about reordering
> >>> between STORE X = 1 and load of ssp->srcu_gp_seq, which is then used to
> >>> piggy-back on an already-started gp.
> >>>
> >>> I suspect that the implicit barrier in srcu_read_lock() invoked at the
> >>> beginning of srcu_gp_start_if_needed() is really the barrier that makes
> >>> all this behave as expected. But without documentation it's rather hard to
> >>> follow.
> >>
> >> Oh ok I see now. It might be working that way by accident or on forgotten
> >> purpose. In any case, we really want to add a comment above that
> >> __srcu_read_lock_nmisafe() call.
> >
> > Another test for the safety (or not) of removing either D or E is
> > to move that WRITE_ONCE() to follow (or, respectively, precede) the
> > adjacent scans.
>
> Good idea, though I believe the MBs that the above talk about are not the flip ones. They are the ones in synchronize_srcu() beginning and end, that order with respect to grace period start and end.
>
> So that (flipping MBs) is unrelated, or did I miss something?

The thought is to manually similate in the source code the maximum
memory-reference reordering that a maximally hostile compiler and CPU
would be permitted to carry out. So yes, given that there are other
memory barriers before and after, these other memory barriers limit how
far the flip may be moved in the source code.

Here I am talking about the memory barriers associated with the flip,
but the same trick can of course be applied to other memory barriers.
In general, remove a given memory barrier and (in the source code)
maximally rearrange the memory references that were previously ordered
by the memory barrier in question.

Again, the presence of other memory barriers will limit the permitted
maximal source-code rearrangement.

Thanx, Paul