Re: task isolation discussion at Linux Plumbers

From: Paul E. McKenney
Date: Wed Nov 09 2016 - 12:38:31 EST


On Wed, Nov 09, 2016 at 03:14:35AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 8, 2016 at 5:40 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:

Thank you for the review and comments!

> > commit 49961e272333ac720ac4ccbaba45521bfea259ae
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > Date: Tue Nov 8 14:25:21 2016 -0800
> >
> > rcu: Maintain special bits at bottom of ->dynticks counter
> >
> > Currently, IPIs are used to force other CPUs to invalidate their TLBs
> > in response to a kernel virtual-memory mapping change. This works, but
> > degrades both battery lifetime (for idle CPUs) and real-time response
> > (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
> > the fact that CPUs executing in usermode are unaffected by stale kernel
> > mappings. It would be better to cause a CPU executing in usermode to
> > wait until it is entering kernel mode to
>
> missing words here?

Just a few, added more. ;-)

> > This commit therefore reserves a bit at the bottom of the ->dynticks
> > counter, which is checked upon exit from extended quiescent states. If it
> > is set, it is cleared and then a new rcu_dynticks_special_exit() macro
> > is invoked, which, if not supplied, is an empty single-pass do-while loop.
> > If this bottom bit is set on -entry- to an extended quiescent state,
> > then a WARN_ON_ONCE() triggers.
> >
> > This bottom bit may be set using a new rcu_dynticks_special_set()
> > function, which returns true if the bit was set, or false if the CPU
> > turned out to not be in an extended quiescent state. Please note that
> > this function refuses to set the bit for a non-nohz_full CPU when that
> > CPU is executing in usermode because usermode execution is tracked by
> > RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.
>
> I'm inclined to suggest s/dynticks/eqs/ in the public API. To me,
> "dynticks" is a feature, whereas "eqs" means "extended quiescent
> state" and means something concrete about the CPU state

OK, I have changed rcu_dynticks_special_exit() to rcu_eqs_special_exit().
I also changed rcu_dynticks_special_set() to rcu_eqs_special_set().

I left rcu_dynticks_snap() as is because it is internal to RCU. (External
only for the benefit of kernel/rcu/tree_trace.c.)

Any others?

Current state of patch below.

> > Reported-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> >
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 4f9b2fa2173d..130d911e4ba1 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
> > return 0;
> > }
> >
> > +static inline bool rcu_dynticks_special_set(int cpu)
> > +{
> > + return false; /* Never flag non-existent other CPUs! */
> > +}
> > +
> > static inline unsigned long get_state_synchronize_rcu(void)
> > {
> > return 0;
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index dbf20b058f48..8de83830e86b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> > };
> >
> > /*
> > + * Steal a bit from the bottom of ->dynticks for idle entry/exit
> > + * control. Initially this is for TLB flushing.
> > + */
> > +#define RCU_DYNTICK_CTRL_MASK 0x1
> > +#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
> > +#ifndef rcu_dynticks_special_exit
> > +#define rcu_dynticks_special_exit() do { } while (0)
> > +#endif
> > +
>
> > /*
> > @@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
> > static void rcu_dynticks_eqs_exit(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > + int seq;
> >
> > /*
> > - * CPUs seeing atomic_inc() must see prior idle sojourns,
> > + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> > * and we also must force ordering with the next RCU read-side
> > * critical section.
> > */
> > - smp_mb__before_atomic(); /* See above. */
> > - atomic_inc(&rdtp->dynticks);
> > - smp_mb__after_atomic(); /* See above. */
> > + seq = atomic_inc_return(&rdtp->dynticks);
> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > - !(atomic_read(&rdtp->dynticks) & 0x1));
> > + !(seq & RCU_DYNTICK_CTRL_CTR));
> > + if (seq & RCU_DYNTICK_CTRL_MASK) {
> > + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > + smp_mb__after_atomic(); /* Clear bits before acting on them */
> > + rcu_dynticks_special_exit();
>
> I think this needs to be reversed for NMI safety: do the callback and
> then clear the bits.

OK. Ah, the race that I was worried about can't happen due to the
fact that rdtp->dynticks gets incremented before the call to
rcu_dynticks_special_exit().

Good catch, fixed.

And the other thing I forgot is that I cannot clear the bottom bits if
this is an NMI handler. But now I cannot construct a case where this
is a problem. The only way this could matter is if an NMI is taken in
an extended quiescent state. In that case, the code flushes and clears
the bit, and any later remote-flush request to this CPU will set the
bit again. And any races between the NMI handler and the other CPU look
the same as between IRQ handlers and process entry.

Yes, this one needs some formal verification, doesn't it?

In the meantime, if you can reproduce the race that led us to believe
that NMI handlers should not clear the bottom bits, please let me know.

> > +/*
> > + * Set the special (bottom) bit of the specified CPU so that it
> > + * will take special action (such as flushing its TLB) on the
> > + * next exit from an extended quiescent state. Returns true if
> > + * the bit was successfully set, or false if the CPU was not in
> > + * an extended quiescent state.
> > + */
> > +bool rcu_dynticks_special_set(int cpu)
> > +{
> > + int old;
> > + int new;
> > + struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +
> > + do {
> > + old = atomic_read(&rdtp->dynticks);
> > + if (old & RCU_DYNTICK_CTRL_CTR)
> > + return false;
> > + new = old | ~RCU_DYNTICK_CTRL_MASK;
>
> Shouldn't this be old | RCU_DYNTICK_CTRL_MASK?

Indeed it should! (What -was- I thinking?) Fixed.

> > + } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> > + return true;
> > }

Thank you again, please see update below.

Thanx, Paul

------------------------------------------------------------------------

commit 7bbb80d5f612e7f0ffc826b11d292a3616150b34
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Date: Tue Nov 8 14:25:21 2016 -0800

rcu: Maintain special bits at bottom of ->dynticks counter

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states.
If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.

This bottom bit may be set using a new rcu_eqs_special_set() function,
which returns true if the bit was set, or false if the CPU turned
out to not be in an extended quiescent state. Please note that this
function refuses to set the bit for a non-nohz_full CPU when that CPU
is executing in usermode because usermode execution is tracked by RCU
as a dyntick-idle extended quiescent state only for nohz_full CPUs.

Reported-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..7232d199a81c 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
return 0;
}

+static inline bool rcu_eqs_special_set(int cpu)
+{
+ return false; /* Never flag non-existent other CPUs! */
+}
+
static inline unsigned long get_state_synchronize_rcu(void)
{
return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..342c8ee402d6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
};

/*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control. Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_eqs_special_exit
+#define rcu_eqs_special_exit() do { } while (0)
+#endif
+
+/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state.
*/
static void rcu_dynticks_eqs_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior RCU read-side critical
- * sections, and we also must force ordering with the next idle
- * sojourn.
+ * CPUs seeing atomic_inc_return() must see prior RCU read-side
+ * critical sections, and we also must force ordering with the
+ * next idle sojourn.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
+ /* Better be in an extended quiescent state! */
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ (seq & RCU_DYNTICK_CTRL_CTR));
+ /* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- atomic_read(&rdtp->dynticks) & 0x1);
+ (seq & RCU_DYNTICK_CTRL_MASK));
}

/*
@@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
static void rcu_dynticks_eqs_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior idle sojourns,
+ * CPUs seeing atomic_inc_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- !(atomic_read(&rdtp->dynticks) & 0x1));
+ !(seq & RCU_DYNTICK_CTRL_CTR));
+ if (seq & RCU_DYNTICK_CTRL_MASK) {
+ rcu_eqs_special_exit();
+ /* Prefer duplicate flushes to losing a flush. */
+ smp_mb__before_atomic(); /* NMI safety. */
+ atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+ }
}

/*
@@ -326,7 +344,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
{
int snap = atomic_add_return(0, &rdtp->dynticks);

- return snap;
+ return snap & ~RCU_DYNTICK_CTRL_MASK;
}

/*
@@ -335,7 +353,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
*/
static bool rcu_dynticks_in_eqs(int snap)
{
- return !(snap & 0x1);
+ return !(snap & RCU_DYNTICK_CTRL_CTR);
}

/*
@@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
static void rcu_dynticks_momentary_idle(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- int special = atomic_add_return(2, &rdtp->dynticks);
+ int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+ &rdtp->dynticks);

/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state. Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_eqs_special_set(int cpu)
+{
+ int old;
+ int new;
+ struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+ do {
+ old = atomic_read(&rdtp->dynticks);
+ if (old & RCU_DYNTICK_CTRL_CTR)
+ return false;
+ new = old | RCU_DYNTICK_CTRL_MASK;
+ } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+ return true;
}

DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..7dcdd59d894c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_eqs_special_set(int cpu);

#ifdef CONFIG_RCU_BOOST
DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);