Re: Futex queue_me/get_user ordering

From: Jakub Jelinek
Date: Wed Nov 17 2004 - 03:48:38 EST


On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote:
> 1. A lost wakeup.
>
> wait_A is woken, but wait_B is not, even though the second
> pthread_cond_signal is "observably" after wait_B.
>
> The operation order is observable in sense that wait_B could
> update the data structure which is protected by cond+mutex, and
> wake_Y could read that update prior to deciding to signal.
>
> _Logically_, what happens is that wait_A is woken by wake_X, but
> it does not immediately re-acquire the mutex. In this time
> window, wait_B and wake_Y both run, and then wait_A acquires the
> mutex. During this window, wait_A is able to absorb the second
> signal.
>
> It's not clear to me if POSIX requires wait_B to be signalled or
> not in this case.
>
> 2. Future lost wakeups.
>
> Future calls to pthread_cond_signal(cond) fail to wake wait_B,
> even much later, because cond's NPTL data structure is
> inconsistent. It's invariant is broken.
>
> This is a bug in NPTL and it's easy to fix. Never increment wake
> unconditionally. Instead, increment it conditionally when (a)
> FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.

If you think it is fixable in userland, please write at least the pseudo
code that you believe should work. We have spent quite a lot of time
on that code and don't believe this is solvable in userland.
E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise
the woken tasks wouldn't see the effect.

I believe the only place this is solvable in is the kernel, by ensuring
atomicity (i.e. queuing task iff curval == expected_val operation atomic
wrt. futex_wake/futex_requeue in other tasks). In the RHEL3 2.4.x backport
this is easy, as spinlock is held around the user access (the page is first
pinned, then lock taken, then value compared (but that is guaranteed to
be non-blocking) and if equal queued, then unlocked and unpinned.
In 2.6.x this is harder if the kernel cannot allow some spinlock to be
taken while doing user access, but I guess the kernel needs to cope
with this, e.g. by queueing the task early but mark it as maybe queued
only. If futex_wake sees such a bit, it would wait until futex_wait
notifies it it has decided whether that one should be queued or not.
Or something else, whatever, as long as the right semantics is ensured.

Just FYI, current pseudo code is (not mentioning cancellation stuff here,
code/data to deal with pthread_cond_destroy semantics, timedwait and
pshared condvars):

typedef struct { int lock, futex; uint64_t total_seq, wakeup_seq, woken_seq;
void *mutex; uint32_t broadcast_seq; } pthread_cond_t;
pthread_cond_signal (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
++wakeup_seq, ++futex;
futex (&futex, FUTEX_WAKE, 1);
}
mutex_unlock (lock);
}
pthread_cond_wait (cond, mtx)
{
mutex_lock (lock);
mutex_unlock (mtx->lock);
++total_seq;
++futex;
mutex = mtx;
bc_seq = broadcast_seq;
seq = wakeup_seq;
do {
val = futex;
mutex_unlock (lock);
futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
if (bc_seq != broadcast_seq)
goto out;
} while (wakeup_seq == seq || woken_seq == wakeup_seq);
++woken_seq;
out:
mutex_unlock (lock);
mutex_lock (mtx->lock);
}
pthread_cond_broadcast (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
woken_seq = wakeup_seq = total_seq;
futex = 2 * total_seq;
++broadcast_seq;
val = futex;
mutex_unlock (lock);
if (futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, &mutex->lock, val) < 0)
futex (&futex, FUTEX_WAKE, INT_MAX);
return;
}
mutex_unlock (lock);
}

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/