Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

From: Peter Zijlstra
Date: Thu May 11 2023 - 09:44:07 EST


On Wed, May 10, 2023 at 05:04:15PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> > Urgh, so I really don't like this.
> >
> > The end result is something like:
> >
> > rt_mutex_lock()
> > sched_submit_work();
> > // a nested rt_mutex_lock() here will not clobber
> > // ->pi_blocked_on because it's not set yet.
> >
> > task_blocks_on_rt_mutex();
> > tsk->pi_blocked_on = waiter;
> > rt_mutex_enqueue(lock, waiter); <-- the real problem
> >
> > rt_mutex_slowlock_block();
> > schedule_rtmutex();
> >
> > sched_resume_work();
> >
> > And all of this it not just because tsk->pi_blocked_on, but mostly
> > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> > enqueue thing is what makes the 'simple' solution of saving/restoring
> > tsk->pi_blocked_on not work.
> >
> > Basically the pi_blocked_on curruption is a side effect, not the
> > fundamental issue. One task having two waiters registered is the bigger
> > issue.
>
> Yes, one task blocks on two locks but the lock in sched_submit_work()
> needs to be solved first.

Sure.. just saying that the focus on ->pi_blocked_on is a wee bit under
representing the issue, because there's a 'simple' solution for
->pi_blocked_on. There is not for the whole waiter situation.

> > Now, sched_submit_work() could also use (regular) mutex -- after all
> > it's a fully preemptible context. And then we're subject to the 'same'
> > problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
>
> Not sure I follow. We only invoke sched_submit_work() if we block on a
> lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
> few of the sched_submit_work() callbacks and they all use non-sleeping
> locks (on !RT).

Right, but this is not enforced, they could change this and we'd never
know.

It used to be ran with IRQs disabled so anybody trying to do 'funny'
things would get yelled at, but commit 9c40cef2b799 ("sched: Move
blk_schedule_flush_plug() out of __schedule()") broke that.

> If a sched_submit_work() would use a mutex_t lock then we would
> recursively call blk_flush_plug() before setting tsk->blocked_on and

I'm not following, mutex code sets tsk->blocked_on before it calls
schedule(), getting into the very same problem you have with rt_mutex.

> perform the same callback and block on the very same lock (again).
> This isn't different compared to !RT therefore you must not use a
> sleeping lock (mutex_t) in the callback.

See the enforcement thing; today nothing stops the code from using a
mutex or other blocking primitives here.


> > pulling out task_is_running() like this is going to get into conflict
> > with TJs patches here:
> >
> > https://lkml.kernel.org/r/20230418205159.724789-1-tj@xxxxxxxxxx
> >
> > That makes sched_submit_work() do things even if task_is_running().
>
> Do I rebase my stuff on top of his then and we good?

I just suggested he try something else:

https://lkml.kernel.org/r/20230510150946.GO4253@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

if that works out this worry goes away.

If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
help out with the validation part...

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..8df1aa333dee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6694,11 +6694,18 @@ void __noreturn do_task_dead(void)

static inline void sched_submit_work(struct task_struct *tsk)
{
+ static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
unsigned int task_flags;

if (task_is_running(tsk))
return;

+ /*
+ * Establish LD_WAIT_CONFIG context to ensure none of the code called
+ * will use a blocking primitive.
+ */
+ lock_map_acquire_try(&sched_map);
+
task_flags = tsk->flags;
/*
* If a worker goes to sleep, notify and ask workqueue whether it
@@ -6723,6 +6730,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
* make sure to submit it to avoid deadlocks.
*/
blk_flush_plug(tsk->plug, true);
+
+ lock_map_release(&sched_map);
}

static void sched_update_worker(struct task_struct *tsk)