Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

From: Peter Zijlstra
Date: Wed May 03 2023 - 09:21:20 EST


On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>
> schedule() invokes sched_submit_work() before scheduling and
> sched_update_worker() afterwards to ensure that queued block requests are
> flushed and the (IO)worker machineries can instantiate new workers if
> required. This avoids deadlocks and starvation.
>
> With rt_mutexes this can lead to subtle problem:
>
> When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> blocks on. When one of the functions in sched_submit/resume_work()
> contends on a rtmutex based lock then that would corrupt
> current::pi_blocked_on.
>
> Make it possible to let rtmutex issue the calls outside of the slowpath,
> i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
>
> - Exposing sched_submit_work() and moving the task_running() condition
> into schedule()
>
> - Renamimg sched_update_worker() to sched_resume_work() and exposing it
> too.
>
> - Providing sched_rtmutex() which just does the inner loop of scheduling
> until need_resched() is not longer set. Split out the loop so this does
> not create yet another copy.
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>

Urgh, so I really don't like this.

The end result is something like:

rt_mutex_lock()
sched_submit_work();
// a nested rt_mutex_lock() here will not clobber
// ->pi_blocked_on because it's not set yet.

task_blocks_on_rt_mutex();
tsk->pi_blocked_on = waiter;
rt_mutex_enqueue(lock, waiter); <-- the real problem

rt_mutex_slowlock_block();
schedule_rtmutex();

sched_resume_work();

And all of this it not just because tsk->pi_blocked_on, but mostly
because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
enqueue thing is what makes the 'simple' solution of saving/restoring
tsk->pi_blocked_on not work.

Basically the pi_blocked_on curruption is a side effect, not the
fundamental issue. One task having two waiters registered is the bigger
issue.

Now, sched_submit_work() could also use (regular) mutex -- after all
it's a fully preemptible context. And then we're subject to the 'same'
problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

This means that strictly speaking we should litter mutex with the same
thing :/

This all feels fragile to me. Too many things spread out in too many
places. An alternative is something like:

void __sched schedule_pi(void)
{
struct task_struct *tsk = current;
void *waiter = tsk->pi_blocked_on;

sched_submit_work(tsk);
do {
preempt_disable();
if (rt_mutex_blocks(tsk, waiter))
schedule();
sched_preempt_enable_no_resched();
} while (need_resched());
sched_update_worker(tsk);
}

And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
thing. However, this is going to be a massive reorg of the rt_mutex code
and I'm not entirely sure the end result will actually be better... it
might just make a mess elsewhere :/

> +void sched_submit_work(void)
> {
> - unsigned int task_flags;
> + struct task_struct *tsk = current;
> + unsigned int task_flags = tsk->flags;
>
> - if (task_is_running(tsk))
> - return;
> -
> - task_flags = tsk->flags;
> /*
> * If a worker goes to sleep, notify and ask workqueue whether it
> * wants to wake up a task to maintain concurrency.
> @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> blk_flush_plug(tsk->plug, true);
> }

> +asmlinkage __visible void __sched schedule(void)
> +{
> + if (!task_is_running(current))
> + sched_submit_work();
> + schedule_loop(SM_NONE);
> + sched_resume_work();
> }
> EXPORT_SYMBOL(schedule);

pulling out task_is_running() like this is going to get into conflict
with TJs patches here:

https://lkml.kernel.org/r/20230418205159.724789-1-tj@xxxxxxxxxx

That makes sched_submit_work() do things even if task_is_running().