[PATCH v2 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

From: Sebastian Andrzej Siewior
Date: Fri Aug 25 2023 - 14:11:25 EST


Hi!

This is basically the 'same' patches as send earlier by PeterZ here:
https://lore.kernel.org/all/20230815110121.117752409@xxxxxxxxxxxxx/

Here is a diff vs the original submission, which includes what has been
discussed on the list, only patch 5 has been changed:

|5: 2b0bb1b525cf1 ! 5: 91f7c083c398d locking/rtmutex: Use rt_mutex specific scheduler helpers
| @@ Commit message
| Link: https://lore.kernel.org/r/20230815111430.421408298@xxxxxxxxxxxxx
| Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
|
| + ## kernel/futex/pi.c ##
| +@@
| + // SPDX-License-Identifier: GPL-2.0-or-later
| +
| + #include <linux/slab.h>
| ++#include <linux/sched/rt.h>
| + #include <linux/sched/task.h>
| +
| + #include "futex.h"
| +@@ kernel/futex/pi.c: int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
| + goto no_block;
| + }
| +
| ++ /*
| ++ * Must be done before we enqueue the waiter, here is unfortunately
| ++ * under the hb lock, but that *should* work because it does nothing.
| ++ */
| ++ rt_mutex_pre_schedule();
| ++
| + rt_mutex_init_waiter(&rt_waiter);
| +
| + /*
| +@@ kernel/futex/pi.c: int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
| + if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
| + ret = 0;
| +
| ++ /*
| ++ * Waiter is unqueued.
| ++ */
| ++ rt_mutex_post_schedule();
| + no_block:
| + /*
| + * Fixup the pi_state owner and possibly acquire the lock if we
| +
| ## kernel/locking/rtmutex.c ##
| @@ kernel/locking/rtmutex.c: static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
| raw_spin_unlock_irq(&lock->wait_lock);
| @@ kernel/locking/rwbase_rt.c: static int __sched __rwbase_read_lock(struct rwbase_
| return ret;
| }
|
| +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
| + /* Force readers into slow path */
| + atomic_sub(READER_BIAS, &rwb->readers);
| +
| ++ rt_mutex_pre_schedule();
| ++
| + raw_spin_lock_irqsave(&rtm->wait_lock, flags);
| + if (__rwbase_write_trylock(rwb))
| + goto out_unlock;
| +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
| + if (rwbase_signal_pending_state(state, current)) {
| + rwbase_restore_current_state();
| + __rwbase_write_unlock(rwb, 0, flags);
| ++ rt_mutex_post_schedule();
| + trace_contention_end(rwb, -EINTR);
| + return -EINTR;
| + }
| +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
| +
| + out_unlock:
| + raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
| ++ rt_mutex_post_schedule();
| + return 0;
| + }
| +
|
| ## kernel/locking/rwsem.c ##
| @@ kernel/locking/rwsem.c: static inline void __downgrade_write(struct rw_semaphore *sem)

I just stuffed this into latest -RT. As for the futex_lock_pi() change,
I wouldn't do the "whole" rt_mutex_pre_schedule() thingy since for futex
as it has to be a nop because no I/O is done. So something like
| rt_mutex_pre_schedule_futex()
| {
| lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
| WARN_ON_ONCE(tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER));
| SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
| WARN_ON_ONCE(tsk->plug);
| }

but then why over complicate things. Speaking of over complicating, this
triggered:
| ------------[ cut here ]------------
| WARNING: CPU: 6 PID: 2155 at kernel/locking/spinlock_rt.c:40 rt_spin_lock+0x5a/0xf0
| Modules linked in:
| CPU: 6 PID: 2155 Comm: futex_requeue_p Not tainted 6.5.0-rc7-rt3+ #272
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
| RIP: 0010:rt_spin_lock+0x5a/0xf0
| Call Trace:
| <TASK>
| ? rt_spin_lock+0x5a/0xf0
| futex_wait_requeue_pi+0x3a4/0x5c0
| do_futex+0x152/0x190
| __x64_sys_futex+0x110/0x1b0
| do_syscall_64+0x44/0x90
| entry_SYSCALL_64_after_hwframe+0x6f/0xd9
| RIP: 0033:0x7f8a79ed34f9
| ---[ end trace 0000000000000000 ]---

and is
| static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
| {
| lockdep_assert(!current->pi_blocked_on);

Here, rt_mutex_wait_proxy_lock() returned with -EINTR, didn't acquire
the lock. Later rt_mutex_cleanup_proxy_lock() would clean
->pi_blocked_on but that happens after
| /* Current is not longer pi_blocked_on */
| spin_lock(q.lock_ptr);

and that comment is accurate for r != 0. This isn't a regression from
before, I just didn't spot it earlier.

Sebastian