[PATCH] futex: Avoid reusing outdated pi_state.

From: Sebastian Andrzej Siewior
Date: Tue Jan 16 2024 - 08:08:27 EST


Jiri Slaby reported a futex state inconsistency resulting in -EINVAL
during a lock operation for a PI futex. A requirement is that the lock
process is interrupted by a timeout or signal:

T1 T2
*owns* futex
futex_lock_pi()
*create PI state, attach to it, queue RT waiter*
rt_mutex_wait_proxy_lock() /* -ETIMEDOUT */
rt_mutex_cleanup_proxy_lock()
remove_waiter()

futex_unlock_pi()
spin_lock(&hb->lock);
top_waiter = futex_top_waiter(hb, &key);
/* top_waiter is NULL, do_uncontended */
spin_unlock(&hb->lock);

To spice things up, player T3 and T4 enter the game:

T3 T4
*acquires futex in userland*
futex_lock_pi()
futex_q_lock(&q);
futex_lock_pi_atomic()
top_waiter = futex_top_waiter(hb, key);
/* top_waiter is from T1, still */
attach_to_pi_state()
/* Here -EINVAL is returned because uval
* points to T3 but pi_state says T1.
*/

We must not unlock the futex for userland as long as there is still a
state pending in kernel. It can be used by further futex_lock_pi()
caller (as it has been observed by futex_unlock_pi()). The caller will
observe an outdated state of the futex because it was not removed during
unlock operation in kernel.

The lock can not be handed over to T1 because it already gave up and
stared to clean up.
All futex_q entries point to the same pi_state and the pi_mutex has no
waiters. A waiter can not be enqueued because hb->lock +
pi_mutex.wait_lock is acquired (by the unlock operation) and the same
ordering is used by futex_lock_pi() during locking.

Remove all futex_q entries from the hb list which point to the futex if
no waiter has been observed. This closes the race window by removing all
pointer to the previous in-kernel state.
The leaving futex_lock_pi() caller can clean up the pi-state once it
acquires hb->lock. The following futex_lock_pi() caller will create a
new in-kernel state.
The optional removal from hb->chain is only needed if the futex was not
acquired because it might have been done by the unlock path with
hb->lock acquired.

Fixes: fbeb558b0dd0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx>
Closes: c737a604-d441-49c6-a5cd-ef01e9f2a454@xxxxxxxxxx
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
---
kernel/futex/core.c | 9 +++++++--
kernel/futex/futex.h | 2 +-
kernel/futex/pi.c | 11 +++++++----
kernel/futex/requeue.c | 2 +-
4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index dad981a865b84..31505b0a405ae 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -628,10 +628,15 @@ int futex_unqueue(struct futex_q *q)
/*
* PI futexes can not be requeued and must remove themselves from the
* hash bucket. The hash bucket lock (i.e. lock_ptr) is held.
+ * If the PI futex was not acquired (due to timeout or signal) then it removes
+ * its rt_waiter before it removes itself from the futex queue. The unlocker
+ * will remove the futex_q from the queue if it observes an empty waitqueue.
+ * Therefore the unqueue is optional in this case.
*/
-void futex_unqueue_pi(struct futex_q *q)
+void futex_unqueue_pi(struct futex_q *q, bool have_lock)
{
- __futex_unqueue(q);
+ if (have_lock || !plist_node_empty(&q->list))
+ __futex_unqueue(q);

BUG_ON(!q->pi_state);
put_pi_state(q->pi_state);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 8b195d06f4e8e..c7133ffb381fd 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -252,7 +252,7 @@ static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
spin_unlock(&hb->lock);
}

-extern void futex_unqueue_pi(struct futex_q *q);
+extern void futex_unqueue_pi(struct futex_q *q, bool have_lock);

extern void wait_for_owner_exiting(int ret, struct task_struct *exiting);

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 90e5197f4e569..4023841358eea 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1070,6 +1070,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* haven't already.
*/
res = fixup_pi_owner(uaddr, &q, !ret);
+ futex_unqueue_pi(&q, !ret);
/*
* If fixup_pi_owner() returned an error, propagate that. If it acquired
* the lock, clear our -ETIMEDOUT or -EINTR.
@@ -1077,7 +1078,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
if (res)
ret = (res < 0) ? res : 0;

- futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
goto out;

@@ -1135,6 +1135,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)

hb = futex_hash(&key);
spin_lock(&hb->lock);
+retry_hb:

/*
* Check waiters first. We do not trust user space values at
@@ -1177,12 +1178,15 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
/*
* Futex vs rt_mutex waiter state -- if there are no rt_mutex
* waiters even though futex thinks there are, then the waiter
- * is leaving and the uncontended path is safe to take.
+ * is leaving. We need to remove it from the list so that the
+ * current PI-state is not observed by future pi_futex_lock()
+ * caller before the leaving waiter had a chance to clean up.
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
+ __futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- goto do_uncontended;
+ goto retry_hb;
}

get_pi_state(pi_state);
@@ -1217,7 +1221,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
return ret;
}

-do_uncontended:
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index eb21f065816ba..57869ef20bda3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -873,7 +873,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
if (res)
ret = (res < 0) ? res : 0;

- futex_unqueue_pi(&q);
+ futex_unqueue_pi(&q, true);
spin_unlock(q.lock_ptr);

if (ret == -EINTR) {
--
2.43.0