Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles andenergy

From: Mathieu Desnoyers
Date: Sun Oct 04 2009 - 17:13:41 EST


* Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> On Sun, Oct 04, 2009 at 10:37:45AM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> > > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote:
> > > > Hi,
> > > >
> > > > When implementing the call_rcu() "worker thread" in userspace, I ran
> > > > into the problem that it had to be woken up periodically to check if
> > > > there are any callbacks to execute. However, I easily imagine that this
> > > > does not fit well with the "green computing" definition.
> > > >
> > > > Therefore, I've looked at ways to have the call_rcu() callers waking up
> > > > this worker thread when callbacks are enqueued. However, I don't want to
> > > > take any lock and the fast path (when no wake up is required) should not
> > > > cause any cache-line exchange.
> > > >
> > > > Here are the primitives I've created. I'd like to have feedback on my
> > > > futex use, just to make sure I did not do any incorrect assumptions.
> > > >
> > > > This could also be eventually used in the QSBR Userspace RCU quiescent
> > > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
> > > > ensure synchronize_rcu() does not busy-wait for too long.
> > > >
> > > > /*
> > > > * Wake-up any waiting defer thread. Called from many concurrent threads.
> > > > */
> > > > static void wake_up_defer(void)
> > > > {
> > > > if (unlikely(atomic_read(&defer_thread_futex) == -1))
> > > > atomic_set(&defer_thread_futex, 0);
> > > > futex(&defer_thread_futex, FUTEX_WAKE,
> > > > 0, NULL, NULL, 0);
> > > > }
> > > >
> > > > /*
> > > > * Defer thread waiting. Single thread.
> > > > */
> > > > static void wait_defer(void)
> > > > {
> > > > atomic_dec(&defer_thread_futex);
> > > > if (atomic_read(&defer_thread_futex) == -1)
> > > > futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > > > NULL, NULL, 0);
> > > > }
> > >
> > > The standard approach would be to use pthread_cond_wait() and
> > > pthread_cond_broadcast(). Unfortunately, this would require holding a
> > > pthread_mutex_lock across both operations, which would not necessarily
> > > be so good for wake-up-side scalability.
> >
> > The pthread_cond_broadcast() mutex is really a bugger when it comes to
> > execute it at each rcu_read_unlock(). We could as well use a mutex to
> > protect the whole read-side.. :-(
> >
> > > That said, without this sort of heavy-locking approach, wakeup races
> > > are quite difficult to avoid.
> >
> > I did a formal model of my futex-based wait/wakeup. The main idea is
> > that the waiter:
> >
> > - Set itself to "waiting"
> > - Checks the "real condition" for which it will wait (e.g. queues empty
> > when used for rcu callbacks, no more ongoing old reader thread C.S.
> > when used in synchronize_rcu())
> > - Calls sys_futex if the variable have not changed.
> >
> > And the waker:
> > - sets the "real condition" waking up the waiter (enqueuing, or
> > rcu_read_unlock())
> > - check if the waiter must be woken up, if so, wake it up by setting the
> > state to "running" and calling sys_futex.
> >
> > But as you say, wakeup races are difficult (but not impossible!) to
> > avoid. This is why I resorted to a formal model of the wait/wakeup
> > scheme to ensure that we cannot end up in a situation where a waker
> > races with the waiter and does not wake it up when it should. This is
> > nothing fancy (does not model memory and instruction reordering
> > automatically), but I figure that memory barriers are required between
> > almost every steps of this algorithm, so by adding smp_mb() I end up
> > ensure sequential behavior. I added test cases in the model to ensure
> > that incorrect memory reordering _would_ cause errors by doing the
> > reordering by hand in error-injection runs.
>
> My question is whether pthread_cond_wait() and pthread_cond_broadcast()
> can substitute for the raw call to futex. Unless I am missing something
> (which I quite possibly am), the kernel will serialize on the futex
> anyway, so serialization in user-mode code does not add much additional
> pain.

The kernel sys_futex implementation only takes per-bucket spinlocks. So
this is far from the cost of a global mutex in pthread_cond. Moreover,
my scheme does not require to take any mutex in the fast path (when
there is no waiter to wake up), which makes performances appropriate for
use in rcu read-side. It's a simple memory barrier, variable read, test
and branch in this case.

>
> > The model is available at:
> > http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model
> >
> > (this is in the formal-model branch of the urcu tree, futex-wakeup
> > subdir)
> >
> > This is modeling this snippet of code :
> >
> > static int defer_thread_futex;
> >
> > /*
> > * Wake-up any waiting defer thread. Called from many concurrent threads.
> > */
> > static void wake_up_defer(void)
> > {
> > if (unlikely(uatomic_read(&defer_thread_futex) == -1)) {
> > uatomic_set(&defer_thread_futex, 0);
> > futex(&defer_thread_futex, FUTEX_WAKE, 1,
> > NULL, NULL, 0);
> > }
> > }
> >
> > static void enqueue(void *callback) /* not the actual types */
> > {
> > add_to_queue(callback);
> > smp_mb();
> > wake_up_defer();
> > }
> >
> > /*
> > * rcu_defer_num_callbacks() returns the total number of callbacks
> > * enqueued.
> > */
> >
> > /*
> > * Defer thread waiting. Single thread.
> > */
> > static void wait_defer(void)
> > {
> > uatomic_dec(&defer_thread_futex);
> > smp_mb(); /* Write futex before read queue */
> > if (rcu_defer_num_callbacks()) {
> > smp_mb(); /* Read queue before write futex */
> > /* Callbacks are queued, don't wait. */
> > uatomic_set(&defer_thread_futex, 0);
> > } else {
> > smp_rmb(); /* Read queue before read futex */
> > if (uatomic_read(&defer_thread_futex) == -1)
> > futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > NULL, NULL, 0);
> > }
> > }
> >
> >
> > Comments are welcome,
>
> I will take a look after further recovery from jetlag. Not yet competent
> to review this kind of stuff. Give me a few days. ;-)

No problem, thanks for looking at this,

Mathieu

>
> Thanx, Paul

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/