[PATCH/RFC] Futex mmap_sem deadlock

From: Olof Johansson
Date: Tue Feb 22 2005 - 14:17:31 EST


Hi,

Consider a small testcase that spawns off two threads, either thread
doing a loop of:

buf = mmap /dev/zero MAP_SHARED for 0x100000 bytes
call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap
munmap(buf)
repeat

This will quickly lock up, since the futex_wait code dows a
down_read(mmap_sem), then a get_user().

The do_page_fault code on ppc64 (as well as other architectures) needs
to take the same semaphore for reading. This is all good until the
second thread comes into play: Its mmap call tries to take the same
semaphore for writing which causes in the do_page_fault down_read()
to get stuck. Classic deadlock.

Stacks are as follows:

a.out S 000000000ff9772c 12288 6491 6484 6492 (NOTLB)
Call Trace:
[c0000003eb72fa10] [0000000030ecbd28] 0x30ecbd28 (unreliable)
[c0000003eb72fbe0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c0000003eb72fc70] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c0000003eb72fd90] [c0000000000247c8] .sys32_rt_sigsuspend+0xe0/0x120
[c0000003eb72fe30] [c00000000000d7e8] .ppc32_rt_sigsuspend+0x8/0x14
a.out D 000000000ff0ad38 13632 6492 6491 6493 (NOTLB)
Call Trace:
[c00000003a1a7830] [c00000003a1a7850] 0xc00000003a1a7850 (unreliable)
[c00000003a1a7a00] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000003a1a7a90] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000003a1a7bb0] [c000000000441ec0] .rwsem_down_write_failed+0x138/0x2f0
[c00000003a1a7c90] [c000000000098db0] .sys_mprotect+0x7bc/0x81c
[c00000003a1a7e30] [c00000000000d600] syscall_exit+0x0/0x18
a.out D 0000000010000654 12832 6493 6492 (NOTLB)
Call Trace:
[c00000000f2674f0] [c0000000000119ec] .__switch_to+0xa4/0xf0
[c00000000f267580] [c00000000043ffbc] .schedule+0x4d4/0xe5c
[c00000000f2676a0] [c0000000004421b4] .rwsem_down_read_failed+0x13c/0x2f4
[c00000000f267780] [c00000000003ba6c] .do_page_fault+0x66c/0x738
[c00000000f267900] [c00000000000b2dc] .handle_page_fault+0x20/0x54
--- Exception: 301 at .do_futex+0x5d4/0x720
LR = .do_futex+0x1e0/0x720
[c00000000f267d60] [c000000000078124] .compat_sys_futex+0xa4/0x16c
[c00000000f267e30] [c00000000000d600] syscall_exit+0x0/0x18


One attempt to fix this is included below. It works, but I'm not entirely
happy with the fact that it's a bit messy solution. If anyone has a
better idea for how to solve it I'd be all ears.

Auditing other read-takers of mmap_sem, I found one more exposure that
could be solved by just moving code around.

-----

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retake it.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <olof@xxxxxxxxxxxxxx>


Index: linux-2.5/kernel/futex.c
===================================================================
--- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.000000000 -0600
+++ linux-2.5/kernel/futex.c 2005-02-22 12:35:01.000000000 -0600
@@ -326,11 +326,22 @@
struct futex_hash_bucket *bh1, *bh2;
struct list_head *head1;
struct futex_q *this, *next;
- int ret, drop_count = 0;
+ int ret, curval, drop_count = 0;
unsigned int nqueued;

down_read(&current->mm->mmap_sem);

+ /* Make sure we can access the page, since there's a risk to
+ * deadlock with do_page_fault() if get_user() faults.
+ */
+ while (!check_user_page_readable(current->mm, uaddr1)) {
+ up_read(&current->mm->mmap_sem);
+ /* Fault in the page through get_user() but discard result */
+ if (get_user(curval, (int __user *)uaddr1) != 0)
+ return -EFAULT;
+ down_read(&current->mm->mmap_sem);
+ }
+
ret = get_futex_key(uaddr1, &key1);
if (unlikely(ret != 0))
goto out;
@@ -343,8 +354,6 @@

nqueued = bh1->nqueued;
if (likely(valp != NULL)) {
- int curval;
-
/* In order to avoid doing get_user while
holding bh1->lock and bh2->lock, nqueued
(monotonically increasing field) must be first
@@ -482,6 +491,19 @@

down_read(&current->mm->mmap_sem);

+ /* Make sure we can access the page, since there's a risk to
+ * deadlock with do_page_fault() if get_user() faults.
+ */
+ while (!check_user_page_readable(current->mm, uaddr)) {
+ up_read(&current->mm->mmap_sem);
+
+ /* Fault in the page through get_user() but discard result */
+ if (get_user(curval, (int __user *)uaddr) != 0)
+ return -EFAULT;
+
+ down_read(&current->mm->mmap_sem);
+ }
+
ret = get_futex_key(uaddr, &q.key);
if (unlikely(ret != 0))
goto out_release_sem;
Index: linux-2.5/mm/mempolicy.c
===================================================================
--- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.000000000 -0600
+++ linux-2.5/mm/mempolicy.c 2005-02-21 16:43:08.000000000 -0600
@@ -486,6 +486,7 @@
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct mempolicy *pol = current->mempolicy;
+ DECLARE_BITMAP(nodes, MAX_NUMNODES);

if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
return -EINVAL;
@@ -524,16 +525,21 @@
} else
pval = pol->policy;

+ if (nmask)
+ get_zonemask(pol, nodes);
+
+ if (vma) {
+ up_read(&current->mm->mmap_sem);
+ vma = NULL;
+ }
+
err = -EFAULT;
if (policy && put_user(pval, policy))
goto out;

err = 0;
- if (nmask) {
- DECLARE_BITMAP(nodes, MAX_NUMNODES);
- get_zonemask(pol, nodes);
+ if (nmask)
err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
- }

out:
if (vma)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/