Re: Udpated sys_membarrier() speedup patch, FYI

From: Paul E. McKenney
Date: Thu Jul 27 2017 - 16:37:15 EST


On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
> On 07/27/2017 10:43 PM, Paul E. McKenney wrote:
> >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
> >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
> >>>Hello!
> >>>
> >>>Please see below for a prototype sys_membarrier() speedup patch.
> >>>Please note that there is some controversy on this subject, so the final
> >>>version will probably be quite a bit different than this prototype.
> >>>
> >>>But my main question is whether the throttling shown below is acceptable
> >>>for your use cases, namely only one expedited sys_membarrier() permitted
> >>>per scheduling-clock period (1 millisecond on many platforms), with any
> >>>excess being silently converted to non-expedited form. The reason for
> >>>the throttling is concerns about DoS attacks based on user code with a
> >>>tight loop invoking this system call.
> >>>
> >>>Thoughts?
> >>Silent throttling would render it useless for me. -EAGAIN is a
> >>little better, but I'd be forced to spin until either I get kicked
> >>out of my loop, or it succeeds.
> >>
> >>IPIing only running threads of my process would be perfect. In fact
> >>I might even be able to make use of "membarrier these threads
> >>please" to reduce IPIs, when I change the topology from fully
> >>connected to something more sparse, on larger machines.
> >>
> >>My previous implementations were a signal (but that's horrible on
> >>large machines) and trylock + mprotect (but that doesn't work on
> >>ARM).
> >OK, how about the following patch, which IPIs only the running
> >threads of the process doing the sys_membarrier()?
>
> Works for me.

Thank you for testing! I expect that Mathieu will have a v2 soon,
hopefully CCing you guys. (If not, I will forward it.)

Mathieu, please note Avi's feedback below.

Thanx, Paul

> >------------------------------------------------------------------------
> >
> >From: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> >To: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >Cc: linux-kernel@xxxxxxxxxxxxxxx, Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx>,
> > "Paul E . McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>, Boqun Feng <boqun.feng@xxxxxxxxx>
> >Subject: [RFC PATCH] membarrier: expedited private command
> >Date: Thu, 27 Jul 2017 14:59:43 -0400
> >Message-Id: <20170727185943.11570-1-mathieu.desnoyers@xxxxxxxxxxxx>
> >
> >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> >from all runqueues for which current thread's mm is the same as our own.
> >
> >Scheduler-wise, it requires that we add a memory barrier after context
> >switching between processes (which have different mm).
> >
> >It would be interesting to benchmark the overhead of this added barrier
> >on the performance of context switching between processes. If the
> >preexisting overhead of switching between mm is high enough, the
> >overhead of adding this extra barrier may be insignificant.
> >
> >[ Compile-tested only! ]
> >
> >CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >CC: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> >CC: Boqun Feng <boqun.feng@xxxxxxxxx>
> >Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> >---
> > include/uapi/linux/membarrier.h | 8 +++--
> > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/core.c | 21 ++++++++++++
> > 3 files changed, 102 insertions(+), 3 deletions(-)
> >
> >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> >index e0b108bd2624..6a33c5852f6b 100644
> >--- a/include/uapi/linux/membarrier.h
> >+++ b/include/uapi/linux/membarrier.h
> >@@ -40,14 +40,18 @@
> > * (non-running threads are de facto in such a
> > * state). This covers threads from all processes
> > * running on the system. This command returns 0.
> >+ * TODO: documentation.
> > *
> > * Command to be passed to the membarrier system call. The commands need to
> > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> > * the value 0.
> > */
> > enum membarrier_cmd {
> >- MEMBARRIER_CMD_QUERY = 0,
> >- MEMBARRIER_CMD_SHARED = (1 << 0),
> >+ MEMBARRIER_CMD_QUERY = 0,
> >+ MEMBARRIER_CMD_SHARED = (1 << 0),
> >+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> >+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> >+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
> > };
> >
> > #endif /* _UAPI_LINUX_MEMBARRIER_H */
> >diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> >index 9f9284f37f8d..8c6c0f96f617 100644
> >--- a/kernel/membarrier.c
> >+++ b/kernel/membarrier.c
> >@@ -19,10 +19,81 @@
> > #include <linux/tick.h>
> >
> > /*
> >+ * XXX For cpu_rq(). Should we rather move
> >+ * membarrier_private_expedited() to sched/core.c or create
> >+ * sched/membarrier.c ?
> >+ */
> >+#include "sched/sched.h"
> >+
> >+/*
> > * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> > * except MEMBARRIER_CMD_QUERY.
> > */
> >-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> >+#define MEMBARRIER_CMD_BITMASK \
> >+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> >+
>
> > rcu_read_unlock();
> >+ }
> >+}
> >+
> >+static void membarrier_private_expedited(void)
> >+{
> >+ int cpu, this_cpu;
> >+ cpumask_var_t tmpmask;
> >+
> >+ if (num_online_cpus() == 1)
> >+ return;
> >+
> >+ /*
> >+ * Matches memory barriers around rq->curr modification in
> >+ * scheduler.
> >+ */
> >+ smp_mb(); /* system call entry is not a mb. */
> >+
> >+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> >+ /* Fallback for OOM. */
> >+ membarrier_private_expedited_ipi_each();
> >+ goto end;
> >+ }
> >+
> >+ this_cpu = raw_smp_processor_id();
> >+ for_each_online_cpu(cpu) {
> >+ struct task_struct *p;
> >+
> >+ if (cpu == this_cpu)
> >+ continue;
> >+ rcu_read_lock();
> >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> >+ if (p && p->mm == current->mm)
> >+ __cpumask_set_cpu(cpu, tmpmask);
>
> This gets you some false positives, if the CPU idled then mm will
> not have changed.

Good point! The battery-powered embedded guys would probably prefer
we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle
state in nohz_full cases. Not sure if is_idle_task() can be used
safely, given things like play_idle().

> >+ rcu_read_unlock();
> >+ }
> >+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> >+ free_cpumask_var(tmpmask);
> >+end:
> >+ /*
> >+ * Memory barrier on the caller thread _after_ we finished
> >+ * waiting for the last IPI. Matches memory barriers around
> >+ * rq->curr modification in scheduler.
> >+ */
> >+ smp_mb(); /* exit from system call is not a mb */
> >+}
> >
> > /**
> > * sys_membarrier - issue memory barriers on a set of threads
> >@@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> > if (num_online_cpus() > 1)
> > synchronize_sched();
> > return 0;
> >+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >+ membarrier_private_expedited();
> >+ return 0;
> > default:
> > return -EINVAL;
> > }
> >diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >index 17c667b427b4..f171d2aaaf82 100644
> >--- a/kernel/sched/core.c
> >+++ b/kernel/sched/core.c
> >@@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> > put_user(task_pid_vnr(current), current->set_child_tid);
> > }
> >
> >+#ifdef CONFIG_MEMBARRIER
> >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >+ struct mm_struct *oldmm)
> >+{
> >+ if (likely(mm == oldmm))
> >+ return; /* Thread context switch, same mm. */
> >+ /*
> >+ * When switching between processes, membarrier expedited
> >+ * private requires a memory barrier after we set the current
> >+ * task.
> >+ */
> >+ smp_mb();
> >+}
>
> Won't the actual page table switch generate a barrier, at least on
> many archs? It sure will on x86.

There are apparently at least a few architectures that don't.

> It's also unneeded if kernel entry or exit involve a barrier (not
> true for x86, so probably not for anything else either).
>
> >+#else /* #ifdef CONFIG_MEMBARRIER */
> >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >+ struct mm_struct *oldmm)
> >+{
> >+}
> >+#endif /* #else #ifdef CONFIG_MEMBARRIER */
> >+
> > /*
> > * context_switch - switch to the new MM and the new thread's register state.
> > */
> >@@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >
> > mm = next->mm;
> > oldmm = prev->active_mm;
> >+ membarrier_expedited_mb_after_set_current(mm, oldmm);
> > /*
> > * For paravirt, this is coupled with an exit in switch_to to
> > * combine the page table reload and the switch backend into
>
>