Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

From: Sebastian Andrzej Siewior
Date: Wed May 10 2023 - 11:04:23 EST


On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> Urgh, so I really don't like this.
>
> The end result is something like:
>
> rt_mutex_lock()
> sched_submit_work();
> // a nested rt_mutex_lock() here will not clobber
> // ->pi_blocked_on because it's not set yet.
>
> task_blocks_on_rt_mutex();
> tsk->pi_blocked_on = waiter;
> rt_mutex_enqueue(lock, waiter); <-- the real problem
>
> rt_mutex_slowlock_block();
> schedule_rtmutex();
>
> sched_resume_work();
>
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
>
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Yes, one task blocks on two locks but the lock in sched_submit_work()
needs to be solved first.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

Not sure I follow. We only invoke sched_submit_work() if we block on a
lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
few of the sched_submit_work() callbacks and they all use non-sleeping
locks (on !RT).
If a sched_submit_work() would use a mutex_t lock then we would
recursively call blk_flush_plug() before setting tsk->blocked_on and
perform the same callback and block on the very same lock (again).
This isn't different compared to !RT therefore you must not use a
sleeping lock (mutex_t) in the callback.

> This means that strictly speaking we should litter mutex with the same
> thing :/

No need, see above logic.

> This all feels fragile to me. Too many things spread out in too many
> places. An alternative is something like:
>
> void __sched schedule_pi(void)
> {
> struct task_struct *tsk = current;
> void *waiter = tsk->pi_blocked_on;
>
> sched_submit_work(tsk);
> do {
> preempt_disable();
> if (rt_mutex_blocks(tsk, waiter))
> schedule();
> sched_preempt_enable_no_resched();
> } while (need_resched());
> sched_update_worker(tsk);
> }
>
> And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
> thing. However, this is going to be a massive reorg of the rt_mutex code
> and I'm not entirely sure the end result will actually be better... it
> might just make a mess elsewhere :/

It might be not needed…

> > @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> > blk_flush_plug(tsk->plug, true);
> > }
>
> > +asmlinkage __visible void __sched schedule(void)
> > +{
> > + if (!task_is_running(current))
> > + sched_submit_work();
> > + schedule_loop(SM_NONE);
> > + sched_resume_work();
> > }
> > EXPORT_SYMBOL(schedule);
>
> pulling out task_is_running() like this is going to get into conflict
> with TJs patches here:
>
> https://lkml.kernel.org/r/20230418205159.724789-1-tj@xxxxxxxxxx
>
> That makes sched_submit_work() do things even if task_is_running().

Do I rebase my stuff on top of his then and we good?

Sebastian