[RFC PATCH] percpu system call: fast userspace percpu critical sections

From: Mathieu Desnoyers
Date: Thu May 21 2015 - 10:45:27 EST

Next message: Allen Hubbe: "RE: [PATCH v2 00/17] NTB: Add NTB hardware abstraction layer"
Previous message: Johan Hovold: "Re: [PATCH] USB: serial: ftdi_sio: Add support for a Motion Tracker Development Board"
Next in thread: Josh Triplett: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Expose a new system call allowing userspace threads to register
a TLS area used as an ABI between the kernel and userspace to
share information required to create efficient per-cpu critical
sections in user-space.

This ABI consists of a thread-local structure containing:

- a nesting count surrounding the critical section,
- a signal number to be sent to the thread when preempting a thread
with non-zero nesting count,
- a flag indicating whether the signal has been sent within the
critical section,
- an integer where to store the current CPU number, updated whenever
the thread is preempted. This CPU number cache is not strictly
needed, but performs better than getcpu vdso.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections, ref. http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

What is done differently here compared to percpu atomics: we track
a single nesting counter per thread rather than many ranges of
instruction pointer values. We deliver a signal to user-space and
let the logic of restart be handled in user-space, thus moving
the complexity out of the kernel. The nesting counter approach
allows us to skip the complexity of interacting with signals that
would be otherwise needed with the percpu atomics approach, which
needs to know which instruction pointers are preempted, including
when preemption occurs on a signal handler nested over an instruction
pointer of interest.

Advantages of this approach over percpu atomics:
- kernel code is relatively simple: complexity of restart sections
is in user-space,
- easy to port to other architectures: just need to reserve a new
system call,
- for threads which have registered a TLS structure, the fast-path
at preemption is only a nesting counter check, along with the
optional store of the current CPU number, rather than comparing
instruction pointer with possibly many registered ranges,

Caveats of this approach compared to the percpu atomics:
- We need a signal number for this, so it cannot be done without
designing the application accordingly,
- Handling restart in user-space is currently performed with page
protection, for which we install a SIGSEGV signal handler. Again,
this requires designing the application accordingly, especially
if the application installs its own segmentation fault handler,
- It cannot be used for tracing of processes by injection of code
into their address space, due to interactions with application
signal handlers.

The user-space proof of concept code implementing the restart section
can be found here: https://github.com/compudj/percpu-dev

Benchmarking sched_getcpu() vs tls cache approach. Getting the
current CPU number:

- With Linux vdso: 12.7 ns
- With TLS-cached cpu number: 0.3 ns

We will use the TLS-cached cpu number for the following
benchmarks.

On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
with a baseline running very few load/stores (no locking,
no getcpu, assuming one thread per CPU with affinity),
against locking scheme based on "lock; cmpxchg", "cmpxchg"
(using restart signal), load-store (using restart signal).
This is performed with 32 threads on a 16-core, hyperthread
system:

ns/loop overhead (ns)
Baseline: 3.7 0.0
lock; cmpxchg: 22.0 18.3
cmpxchg: 11.1 7.4
load-store: 9.4 5.7

Therefore, the load-store scheme has a speedup of 3.2x over the
"lock; cmpxchg" scheme if both are using the tls-cache for the
CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
we reach of speedup of 5.4x for load-store+tls-cache vs
"lock; cmpxchg"+vdso-getcpu.

I'm sending this out to trigger discussion, and hopefully to see
Paul and Andrew's patches being posted publicly at some point, so
we can compare our approaches.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
CC: Paul Turner <pjt@xxxxxxxxxx>
CC: Andrew Hunter <ahh@xxxxxxxxxx>
CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
CC: Ingo Molnar <mingo@xxxxxxxxxx>
CC: Ben Maurer <bmaurer@xxxxxx>
CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
CC: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
CC: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/exec.c | 1 +
include/linux/sched.h | 18 ++++++
include/uapi/asm-generic/unistd.h | 4 +-
init/Kconfig | 10 +++
kernel/Makefile | 1 +
kernel/fork.c | 2 +
kernel/percpu-user.c | 126 ++++++++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 +
9 files changed, 165 insertions(+), 1 deletion(-)
create mode 100644 kernel/percpu-user.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..0499703 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common percpu sys_percpu

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index c7f9b73..0a2f0b2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+ percpu_user_execve(current);
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..9c88bff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,8 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+struct thread_percpu_user;
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1710,6 +1712,10 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_PERCPU_USER
+ struct preempt_notifier percpu_user_notifier;
+ struct thread_percpu_user __user *percpu_user;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -3090,4 +3096,16 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}

+#ifdef CONFIG_PERCPU_USER
+void percpu_user_fork(struct task_struct *t);
+void percpu_user_execve(struct task_struct *t);
+#else
+static inline void percpu_user_fork(struct task_struct *t)
+{
+}
+static inline void percpu_user_execve(struct task_struct *t)
+{
+}
+#endif
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..f4350d9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_percpu 282
+__SYSCALL(__NR_percpu, sys_percpu)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..73c4070 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1559,6 +1559,16 @@ config PCI_QUIRKS
bugs/quirks. Disable this only if your target machine is
unaffected by PCI quirks.

+config PERCPU_USER
+ bool "Enable percpu() system call" if EXPERT
+ default y
+ select PREEMPT_NOTIFIERS
+ help
+ Enable the percpu() system call which provides a building block
+ for fast per-cpu critical sections in user-space.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..76919a6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_PERCPU_USER) += percpu-user.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..63aaf5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
cgroup_post_fork(p);
if (clone_flags & CLONE_THREAD)
threadgroup_change_end(current);
+ if (!(clone_flags & CLONE_THREAD))
+ percpu_user_fork(p);
perf_event_fork(p);

trace_task_newtask(p, clone_flags);
diff --git a/kernel/percpu-user.c b/kernel/percpu-user.c
new file mode 100644
index 0000000..be3d439
--- /dev/null
+++ b/kernel/percpu-user.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
+ *
+ * percpu system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/preempt.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+
+struct thread_percpu_user {
+ int32_t nesting;
+ int32_t signal_sent;
+ int32_t signo;
+ int32_t current_cpu;
+};
+
+static void percpu_user_sched_in(struct preempt_notifier *notifier, int cpu)
+{
+ struct thread_percpu_user __user *tpu_user;
+ struct thread_percpu_user tpu;
+ struct task_struct *t = current;
+
+ tpu_user = t->percpu_user;
+ if (tpu_user == NULL)
+ return;
+ if (unlikely(t->flags & PF_EXITING))
+ return;
+ /*
+ * access_ok() of tpu_user has already been checked by sys_percpu().
+ */
+ if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+ if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+ if (!tpu.nesting || tpu.signal_sent)
+ return;
+ if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+ tpu.signal_sent = 1;
+ if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+}
+
+static void percpu_user_sched_out(struct preempt_notifier *notifier,
+ struct task_struct *next)
+{
+}
+
+static struct preempt_ops percpu_user_ops = {
+ .sched_in = percpu_user_sched_in,
+ .sched_out = percpu_user_sched_out,
+};
+
+/*
+ * If parent had a percpu-user preempt notifier, we need to setup our own.
+ */
+void percpu_user_fork(struct task_struct *t)
+{
+ struct task_struct *parent = current;
+
+ if (!parent->percpu_user)
+ return;
+ preempt_notifier_init(&t->percpu_user_notifier, &percpu_user_ops);
+ preempt_notifier_register(&t->percpu_user_notifier);
+ t->percpu_user = parent->percpu_user;
+}
+
+void percpu_user_execve(struct task_struct *t)
+{
+ if (!t->percpu_user)
+ return;
+ preempt_notifier_unregister(&t->percpu_user_notifier);
+ t->percpu_user = NULL;
+}
+
+/*
+ * sys_percpu - setup user-space per-cpu critical section for caller thread
+ */
+SYSCALL_DEFINE1(percpu, struct thread_percpu_user __user *, tpu)
+{
+ struct task_struct *t = current;
+
+ if (tpu == NULL) {
+ if (t->percpu_user)
+ preempt_notifier_unregister(&t->percpu_user_notifier);
+ goto set_tpu;
+ }
+ if (!access_ok(VERIFY_WRITE, tpu, sizeof(struct thread_percpu_user)))
+ return -EFAULT;
+ preempt_disable();
+ if (__put_user(smp_processor_id(), &tpu->current_cpu)) {
+ WARN_ON_ONCE(1);
+ preempt_enable();
+ return -EFAULT;
+ }
+ preempt_enable();
+ if (!current->percpu_user) {
+ preempt_notifier_init(&t->percpu_user_notifier,
+ &percpu_user_ops);
+ preempt_notifier_register(&t->percpu_user_notifier);
+ }
+set_tpu:
+ current->percpu_user = tpu;
+ return 0;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..16e2bc8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);

/* execveat */
cond_syscall(sys_execveat);
+
+/* percpu userspace critical sections */
+cond_syscall(sys_percpu);
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Allen Hubbe: "RE: [PATCH v2 00/17] NTB: Add NTB hardware abstraction layer"
Previous message: Johan Hovold: "Re: [PATCH] USB: serial: ftdi_sio: Add support for a Motion Tracker Development Board"
Next in thread: Josh Triplett: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]