Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

From: Mathieu Desnoyers
Date: Tue May 05 2015 - 14:25:18 EST


----- Original Message -----
> On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> > * Benchmarks
> >
> > On Intel Xeon E5405 (8 cores)
> > (one thread is calling sys_membarrier, the other 7 threads are busy
> > looping)
> >
> > 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
> >
> > * User-space user of this system call: Userspace RCU library
> >
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invocation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> >
> > Results in liburcu:
> >
> > Operations in 10s, 6 readers, 2 writers:
> >
> > memory barriers in reader: 1701557485 reads, 3129842 writes
> > signal-based scheme: 9825306874 reads, 5386 writes
> > sys_membarrier: 7992076602 reads, 220 writes
> >
> > The dynamic sys_membarrier availability check adds some overhead to
> > the read-side compared to the signal-based scheme, but besides that,
> > with the expedited scheme, we can see that we are close to the read-side
> > performance of the signal-based scheme. However, this non-expedited
> > sys_membarrier implementation has a much slower grace period than signal
> > and memory barrier schemes.
> >
> > An expedited version of this system call can be added later on to speed
> > up the grace period. Its implementation will likely depend on reading
> > the cpu_curr()->mm without holding each CPU's rq lock.
>
> So, I realize that there's a lot of history tied up in the previous 16
> versions and associated mail threads. However, can you please summarize
> in the commit message what the benefit of merging this version is?
> Because from the text above, from liburcu's perspective, it appears to
> be strictly worse in performance than the signal-based scheme.
>
> There are other non-performance reasons why it might make sense to
> include this; for instance, signals don't play nice with libraries, with
> other processes you might inject yourself into for tracing purposes, or
> with general sanity. However, the explanation for those use cases and
> how membarrier() improves them needs to go in the commit message, rather
> than only in the collective memory and mail archives of people who have
> discussed this patch series.
>
> (My apologies if the if the explanation is in the commit message and
> I've just missed it.)

I will add info about signals vs libraries, which appears to be missing
from the commit message:

"Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application."

The commit message already point out that sys_membarrier diminishes the
number of unnecessary wake-ups sent to other threads compared to the
signal-based approach.

I re-ran those tests on urcu master branch with a slightly modified
version of the sys_membarrier scheme too: a version which assumes that
sys_membarrier is always available. We can then compare apples to
apples performance-wise between signal and membarrier approaches:

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes

It shows that sys_membarrier read-side actually performs slightly
better than the signal-based scheme, in the absence of dynamic
check for syscall availability. This could be enhanced in userspace
eventually if we decide to implement self-modifying code upon
feature detection in liburcu. I'll update the commit message with
this new table.

Thanks!

Mathieu

>
> - Josh Triplett
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/