Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles andenergy

From: Mathieu Desnoyers
Date: Mon Oct 05 2009 - 09:24:03 EST


(added some CC's, given the length of the answer. Thanks for asking) ;)
(Sorry for duplicate, messed up LKML email in original post)

* Michael Schnell (mschnell@xxxxxxxxx) wrote:
> I still don't understand what the advantage of FUTEX is, if the thread
> to be waked is always blocked, and thus the fast path is not in use.
>
> -Michael

Hrm, your assumption about the common case does not seem to fit my
scenarios. Typically, the wakeup will find the waiter not blocked, and
thus skip the call to sys_futex. Here is why.

I use this scheme in two different implementations:

1) in call_rcu(), to wake up the worker thread after adding work to the
queue.

This worker thread, when woken up, sleeps for a few milliseconds before
starting to dequeue work. Therefore, if the system is relatively busy,
call_rcu() will usually see the worker thread while it's sleeping (and
therefore _not_ waiting on the futex). Also, if work is enqueued while
the worker thread is executing past RCU callbacks, the worker thread
will detect it and won't wait on the futex.

Therefore, this is, by design, a very unlikely event to have call_rcu()
calling sys_futex.

2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past
reader's grace periods.

Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if
this does not work, after a few attempts (like the pthread mutexes), it
adapts and uses sys_futex. The waker only need to call sys_futex if
there is a synchronize_rcu() currently running which had to call
sys_futex after a few active attempts failed.

As you see, in both cases, the common case, "fast path", is to find the
futex unlocked and not having to take any lock.


Now, about the slow path. I think it's worth discussing too. Indeed,
sys_futex takes a per-address spinlock, which happens to serialize all
sys_futex operations on the wakeup side. Therefore, for wakeup designs
relying on calling sys_futex for wakeup very frequently, this is really
bad.

There might be ways to mitigate this problem though: changing the
sys_futex implementation to use lock-free lists might help.

One advantage of calling sys_futex without holding a userspace mutex is
that the contention duration on the per-address spinlock is much shorter
than the contention on the mutex, because of the system call execution
overhead.

We could probably turn the sys_futex-dependent locking scheme into
something more portable and manage to keep the same fast-path behavior
if we replace the way I use sys_futex by a pthread cond var, e.g. :

Instead of having:


static inline void wake_up_gp(void)
{
if (unlikely(uatomic_read(&gp_futex) == -1)) {
uatomic_set(&gp_futex, 0);
futex(&gp_futex, FUTEX_WAKE, 1,
NULL, NULL, 0);
}
}

static void wait_gp(void)
{
uatomic_dec(&gp_futex);
smp_mb();
if (!num_old_reader_gp()) {
smp_mb();
atomic_set(&gp_futex, 0);
return;
}
/* Read reader_gp before read futex */
smp_rmb();
if (uatomic_read(&gp_futex) == -1)
futex(&gp_futex, FUTEX_WAIT, -1,
NULL, NULL, 0);
}

We could have:

static inline void wake_up_gp(void)
{
/* just released old reader gp */
smp_mb();
if (unlikely(uatomic_read(&gp_wait) == -1)) {
uatomic_set(&gp_wait, 0);
pthread_cond_broadcast(&rcucond);
}
}

static void wait_gp(void)
{
uatomic_dec(&gp_wait);
smp_mb();
if (!num_old_reader_gp()) {
smp_mb();
atomic_set(&gp_wait, 0);
goto unlock;
}
/* Read reader_gp before read futex */
smp_rmb();
pthread_mutex_lock(&rcumutex);
if (uatomic_read(&gp_wait) == -1) {
pthread_cond_wait(&rcucond, &rcumutex);
pthread_mutex_unlock(&rcumutex);
}
}

Is that what you had in mind ?

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/