Re: Futex queue_me/get_user ordering

From: Hidetoshi Seto
Date: Wed Nov 17 2004 - 20:33:55 EST


Jamie Lokier wrote:
Hidetoshi Seto wrote:

I have to deeply apologize to all for my mistake.
If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!!
I had swallow the story that 2.6 futex has the same problem...

Wrong, 2.6 has the same behaviour!

So I realize that 2.6 futex never behave:

"returns 0 if the futex was not equal to the expected value, but
the process was woken by a FUTEX_WAKE call."

Update of manpage is now unnecessary, I think.

It is necessary.

First of all, I would appreciate if you could read my old post:
"Kernel bug in futex_wait, cause application hang with NPTL"
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html

If my understanding is correct, 2.6 futex does not get any spinlocks,
but a semaphore:

286 static int futex_wake(unsigned long uaddr, int nr_wake)
:
294 down_read(&current->mm->mmap_sem);

477 static int futex_wait(unsigned long uaddr, int val, unsigned long time)
:
483 down_read(&current->mm->mmap_sem);

This semaphore prevents a waiter which temporarily queued to check the val
from being target of wakeup.

No, because it's a read-write semaphore, and we do "down_read" on it
which is a shared lock. It does not prevent concurrent wake and wait
operations!

Aha, yes. You are right.

[About 2.4 futex in RHEL3U2 which takes spinlocks instead]:

However, this spinlocks fail to prevent topical waiters from wakeups.
Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373).
So this failure allows wake_Y to touch the queue while wait_A is in it.

This order is necessary, because it's not safe to call get_user()
while holding any spinlocks. It is not a bug in RHEL.

I think 2.4 is fixable. My original patch for 2.4 was:

/*----- patch begin -----*/

diff -Naur linux-2.4.21-EL3_org/kernel/futex.c linux-2.4.21-EL3/kernel/futex.c
--- linux-2.4.21-EL3_org/kernel/futex.c 2004-08-25 19:47:35.418632860 +0900
+++ linux-2.4.21-EL3/kernel/futex.c 2004-08-25 19:48:32.505546224 +0900
@@ -297,14 +297,20 @@

spin_lock(&vcache_lock);
spin_lock(&futex_lock);
+ ret = __unqueue_me(q);
+ spin_unlock(&futex_lock);
+ spin_unlock(&vcache_lock);
+ return ret;
+}
+
+static inline int __unqueue_me(struct futex_q *q)
+{
if (!list_empty(&q->list)) {
list_del(&q->list);
__detach_vcache(&q->vcache);
- ret = 1;
+ return 1;
}
- spin_unlock(&futex_lock);
- spin_unlock(&vcache_lock);
- return ret;
+ return 0;
}

static inline int futex_wait(unsigned long uaddr,
@@ -333,13 +339,18 @@
* Page is pinned, but may no longer be in this address space.
* It cannot schedule, so we access it with the spinlock held.
*/
- if (!access_ok(VERIFY_READ, uaddr, 4))
- goto out_fault;
+ if (!access_ok(VERIFY_READ, uaddr, 4)) {
+ __unqueue_me(&q);
+ unlock_futex_mm();
+ ret = -EFAULT;
+ goto out;
+ }
kaddr = kmap_atomic(page, KM_USER0);
curval = *(int*)(kaddr + offset);
kunmap_atomic(kaddr, KM_USER0);

if (curval != val) {
+ __unqueue_me(&q);
unlock_futex_mm();
ret = -EWOULDBLOCK;
goto out;
@@ -364,22 +375,18 @@
*/
if (time == 0) {
ret = -ETIMEDOUT;
- goto out;
+ goto out_wait;
}
if (signal_pending(current))
ret = -EINTR;
-out:
+out_wait:
/* Were we woken up anyway? */
if (!unqueue_me(&q))
ret = 0;
+out:
put_page(q.page);

return ret;
-
-out_fault:
- unlock_futex_mm();
- ret = -EFAULT;
- goto out;
}

long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout,

/*----- patch end -----*/

This patch just reorder old codes in fault route:

if(fault){
unlock(futex);
ret = -ERRVAR;
unqueue();
put_page();
return ret;
}

to new one:

if(fault){
unqueue_in_lock();
unlock(futex);
ret = -ERRVAR;
put_page();
return ret;
}

It protects the temporarily queued thread from wakes, doesn't it?

If this work, it could be said that we can fix 2.6 futex with a
spinlock... but it will be slow, slow...


Thanks,
H.Seto

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/