Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

From: Crystal Wood
Date: Tue May 09 2023 - 18:15:38 EST


On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote:
> On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> > From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >
> > schedule() invokes sched_submit_work() before scheduling and
> > sched_update_worker() afterwards to ensure that queued block requests
> > are
> > flushed and the (IO)worker machineries can instantiate new workers if
> > required. This avoids deadlocks and starvation.
> >
> > With rt_mutexes this can lead to subtle problem:
> >
> >   When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> >   blocks on. When one of the functions in sched_submit/resume_work()
> >   contends on a rtmutex based lock then that would corrupt
> >   current::pi_blocked_on.
> >
> > Make it possible to let rtmutex issue the calls outside of the slowpath,
> > i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
> >
> >   - Exposing sched_submit_work() and moving the task_running() condition
> >     into schedule()
> >
> >   - Renamimg sched_update_worker() to sched_resume_work() and exposing
> > it
> >     too.
> >
> >   - Providing sched_rtmutex() which just does the inner loop of
> > scheduling
> >     until need_resched() is not longer set. Split out the loop so this
> > does
> >     not create yet another copy.
> >
> > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
>
> Urgh, so I really don't like this.
>
> The end result is something like:
>
>         rt_mutex_lock()
>           sched_submit_work();
>             // a nested rt_mutex_lock() here will not clobber
>             // ->pi_blocked_on because it's not set yet.
>
>           task_blocks_on_rt_mutex();
>             tsk->pi_blocked_on = waiter;
>             rt_mutex_enqueue(lock, waiter); <-- the real problem
>
>           rt_mutex_slowlock_block();
>             schedule_rtmutex();
>
>           sched_resume_work();
>
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
>
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Where do you see pi_blocked_on being saved/restored? The whole point of
this patchset is to deal with sched_submit_work() before anything has
been done on the "outer" lock acquisition (not just pi_blocked_on, but
also enqueuing) other than failing the fast path.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

It's fully preemptible but it still shouldn't be doing things that would
block on non-RT. That'd already be broken for a number of reasons (task
state corruption, infinite recursion if current->plug isn't cleared
before doing whatever causes another standard schedule(), etc).

-Crystal