Re: [PATCH] sched, fair: Allow a per-cpu kthread waking a task to stack on the same CPU

From: Mel Gorman
Date: Wed Jan 29 2020 - 19:56:50 EST


On Thu, Jan 30, 2020 at 09:00:21AM +1100, Dave Chinner wrote:
> On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> > > window but do you have any thoughts on allowing select_idle_sibling to
> > > stack a wakee task on the same CPU as a waker in this specific case?
> >
> > I sort of see, but *groan*...
> >
> > so if the kworker unlocks a contended mutex/rwsem/completion...
> >
> > I suppose the fact that it limits it to tasks that were running on the
> > same CPU limits the impact if we do get it wrong.
> >
> > Elsewhere you write:
> >
> > > I would prefer the wakeup code did not have to signal that it's a
> > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> > > times where stacking was allowed and then the waker did not go to sleep.
> > > While the chain of events are related, they are not related in a very
> > > obvious way. I think it's much safer to keep this as a scheduler
> > > heuristic instead of depending on callers to have sufficient knowledge
> > > of the scheduler implementation.
> >
> > That is true; the existing WF_SYNC has caused many issues for maybe
> > being too strong.
> >
> > But what if we create a new hint that combines both these ideas? Say
> > WF_COMPLETE and subject that to these same criteria. This way we can
> > eliminate wakeups from locks and such (they won't have this set).
> >
> > Or am I just making things complicated again?
>
> I suspect this is making it complicated again, because it requires
> the people who maintain the code that is using workqueues to
> understand when they might need to use a special wakeup interface in
> the work function. And that includes code that currently calls
> wake_up_all() because there can be hundreds of independent tasks
> waiting on the IO completion (e.g all the wait queues in the XFS
> journal code can (and do) have multiple threads waiting on them).
>
> IOWs, requiring a special flag just to optimise this specific case
> (i.e. single dependent waiter on same CPU as the kworker) when the
> adverse behaviour is both hardware and workload dependent means it
> just won't get used correctly or reliably.
>

I agree. Pick any of Rusty's rules from "-2 Read the implementation
and you'll get it wrong" all the way down to "-10 It's impossible to
get right.".

> Hence I'd much prefer the kernel detects and dynamically handles
> this situation at runtime, because this pattern of workqueue usage
> is already quite common and will only become more widespread as we
> progress towards async processing of syscalls.
>

To be fair, as Peter says, the kernel patch may not detect this
properly. There are corner cases where it will get it wrong. My thinking is
that *at the moment* when the heuristic is wrong, it's almost certainly
because the machine was so over-saturated such that multiple related
tasks are stacking anyway.

Depending on how async syscalls proceeds, this might get turn out to the
the wrong heuristic and an API change will be required. At least if that
happens, we'll have a few use cases to help guide what the API change
should look like so we do not end up in WF_SYNC hell again.

--
Mel Gorman
SUSE Labs