Re: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up_write

From: Hillf Danton
Date: Tue Jan 23 2024 - 10:06:15 EST


On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@xxxxxxxxxx>
> The waitq wakeup in percpu_up_write necessarily runs the wake function
> immediately in the current thread. With it calling
> __percpu_rwsem_trylock on behalf of the thread being woken, the lock is
> extremely fair and FIFO, with the window for unfairness merely being the
> time between the release of sem->block and the completion of a relevant
> trylock.
>
> However, the woken threads that now hold the lock may not be chosen to
> run for a long time, and it would be useful to have more of this window
> available for a currently running thread to unfairly take the lock
> immediately and use it.

It makes no sense for lock acquirer to probe owner's activity except for
spining on owner. Nor for owner to guess if any acquirer comes soon.

> This can result in priority-inversion issues
> with high contention or things like CFS_BANDWIDTH quotas.

Given mutex could not avoid PI (priority-inversion) and deadlock, why is
percpu-rwsem special wrt PI?
>
> The even older version of percpu_rwsem that used an rwsem at its core
> provided some related gains in a different fashion through
> RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
> writers could spin, making it far more likely that a thread would hit
> this window.
>
> Signed-off-by: Ben Segall <bsegall@xxxxxxxxxx>
>
> ---
>
> So the actual problem we saw was that one job had severe slowdowns
> during startup with certain other jobs on the machine, and the slowdowns
> turned out to be some cgroup moves it did during startup. The antagonist
> jobs were spawning huge numbers of threads and some other internal bugs
> were exacerbating their contention. The lock handoff meant that a batch
> of antagonist threads would receive the read lock of
> cgroup_threadgroup_rwsem and at least some of those threads would take a
> long time to be scheduled.

If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
in rwsem_down_read_slowpath().