Re: [PATCH v4] srcu: Clarify comments on memory barrier "E"

From: Paul E. McKenney
Date: Tue Feb 07 2023 - 22:38:43 EST


On Sat, Jan 28, 2023 at 09:09:04PM -0800, Paul E. McKenney wrote:
> On Sat, Jan 28, 2023 at 04:16:34PM -0500, Joel Fernandes wrote:
> > On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > >
> > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote:
> > > > During a flip, we have a full memory barrier before srcu_idx is incremented.
> > > >
> > > > The idea is we intend to order the first phase scan's read of lock
> > > > counters with the flipping of the index.
> > > >
> > > > However, such ordering is already enforced because of the
> > > > control-dependency between the 2 scans. We would be flipping the index
> > > > only if lock and unlock counts matched.
> > > >
> > > > But such match will not happen if there was a pending reader before the flip
> > > > in the first place (observation courtesy Mathieu Desnoyers).
> > > >
> > > > The litmus test below shows this:
> > > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me):
> > >
> > > Much better, thank you!
> > >
> > > I of course did the usual wordsmithing, as shown below. Does this
> > > version capture your intent and understanding?
> > >
> >
> > It looks good to me.
> > According to [1] , the architecture at least should not be reordering
> > read-write control dependency. Only read-read is problematic. But I am
> > not 100% sure, is that not true?
>
> Agreed, READ_ONCE() or stronger through condition to WRITE_ONCE()
> or stronger is ordered. Replace that WRITE_ONCE() with any type of
> unordered read and all bets are off.
>
> And now that the ARM folks chimed in, this is a solid guarantee at
> the hardware level.
>
> Not so much at the compiler level. Oddly enough, compilers do provide
> ordering for plain C-language stores, but that is helpful only if no
> other CPU or thread is concurrently accessing that variable.
>
> > For the compiler, you are saying that read-write control dependency
> > can be reordered even with *ONCE() accesses? In other words, the
> > flipping of idx can happen in ->po order before the locks and unlock
> > counts match? That sounds sort of like a broken compiler.
>
> One case where a sane compiler can reasonably enable the hardware to
> do the reordering is where you have the same WRITE_ONCE() in both the
> then-clause and else-clause of an "if" statement. Another is if it can
> somehow prove something about the value returned from that READ_ONCE(),
> for example, if that value is used to index a singleton array, then the
> compiler has to do the READ_ONCE(), but it can then assume that the
> value returned was zero, throwing away the real value returned.
>
> Fun with undefined behavior!
>
> > [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf
> >
> > More comments below:

Except that it was pointed out to me that the Co-developed-by tags also
need Signed-off-by tags. If there are no objections to the update shown
below, I will fix this on my next rebase.

Thanx, Paul

------------------------------------------------------------------------

commit 6c135bb38c55d354527a6659cbf2f4e7e20b4360
Author: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
Date: Sat Jan 28 03:59:01 2023 +0000

srcu: Clarify comments on memory barrier "E"

There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.

The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.

However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match. And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).

The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:

C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;

// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);

// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D

// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;

// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}

exists (0:lock1=1 /\ 1:idx1=1)

More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.

This commit therefore clarifies the comment on memory barrier E.

Why not also remove that redundant smp_mb()?

Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools. Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath. Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.

Co-developed-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Co-developed-by: Boqun Feng <boqun.feng@xxxxxxxxx>
Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx>
Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index c541b82646b63..cd46fe063e50f 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
static void srcu_flip(struct srcu_struct *ssp)
{
/*
- * Ensure that if this updater saw a given reader's increment
- * from __srcu_read_lock(), that reader was using an old value
- * of ->srcu_idx. Also ensure that if a given reader sees the
- * new value of ->srcu_idx, this updater's earlier scans cannot
- * have seen that reader's increments (which is OK, because this
- * grace period need not wait on that reader).
+ * Because the flip of ->srcu_idx is executed only if the
+ * preceding call to srcu_readers_active_idx_check() found that
+ * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched
+ * and because that summing uses atomic_long_read(), there is
+ * ordering due to a control dependency between that summing and
+ * the WRITE_ONCE() in this call to srcu_flip(). This ordering
+ * ensures that if this updater saw a given reader's increment from
+ * __srcu_read_lock(), that reader was using a value of ->srcu_idx
+ * from before the previous call to srcu_flip(), which should be
+ * quite rare. This ordering thus helps forward progress because
+ * the grace period could otherwise be delayed by additional
+ * calls to __srcu_read_lock() using that old (soon to be new)
+ * value of ->srcu_idx.
+ *
+ * This sum-equality check and ordering also ensures that if
+ * a given call to __srcu_read_lock() uses the new value of
+ * ->srcu_idx, this updater's earlier scans cannot have seen
+ * that reader's increments, which is all to the good, because
+ * this grace period need not wait on that reader. After all,
+ * if those earlier scans had seen that reader, there would have
+ * been a sum mismatch and this code would not be reached.
+ *
+ * This means that the following smp_mb() is redundant, but
+ * it stays until either (1) Compilers learn about this sort of
+ * control dependency or (2) Some production workload running on
+ * a production system is unduly delayed by this slowpath smp_mb().
*/
smp_mb(); /* E */ /* Pairs with B and C. */

- WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1);
+ WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter.

/*
* Ensure that if the updater misses an __srcu_read_unlock()