[RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up_write

From: Benjamin Segall
Date: Mon Jan 22 2024 - 18:00:44 EST


The waitq wakeup in percpu_up_write necessarily runs the wake function
immediately in the current thread. With it calling
__percpu_rwsem_trylock on behalf of the thread being woken, the lock is
extremely fair and FIFO, with the window for unfairness merely being the
time between the release of sem->block and the completion of a relevant
trylock.

However, the woken threads that now hold the lock may not be chosen to
run for a long time, and it would be useful to have more of this window
available for a currently running thread to unfairly take the lock
immediately and use it. This can result in priority-inversion issues
with high contention or things like CFS_BANDWIDTH quotas.

The even older version of percpu_rwsem that used an rwsem at its core
provided some related gains in a different fashion through
RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
writers could spin, making it far more likely that a thread would hit
this window.

Signed-off-by: Ben Segall <bsegall@xxxxxxxxxx>

---

So the actual problem we saw was that one job had severe slowdowns
during startup with certain other jobs on the machine, and the slowdowns
turned out to be some cgroup moves it did during startup. The antagonist
jobs were spawning huge numbers of threads and some other internal bugs
were exacerbating their contention. The lock handoff meant that a batch
of antagonist threads would receive the read lock of
cgroup_threadgroup_rwsem and at least some of those threads would take a
long time to be scheduled.

As a totally artificial test, you can see occasional latencies up
to multiple seconds on moving singlethreaded processes with
cgroup.procs, with an antagonist cpu cgroup of thousands of cpusoakers
and tasks doing (pthread_create; pthread_join) or fork/join. A simple
antagonist is just a cpu+cpuset cgroup and:

bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do :; done&"; done' &
bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do /bin/true; done&"; done' &

favordynmods improves the average cost of the migrate, but doesn't help
much with the spikes. This patch does.

The locktorture results under the current locktorture are mixed (10m
runtime, 72 cpu machine, default settings):

With this patch:
Writes: Total: 56094 Max/Min: 922/680 Fail: 0
Reads : Total: 21522 Max/Min: 305/292 Fail: 0

Baseline:
Writes: Total: 33801 Max/Min: 472/468 Fail: 0
Reads : Total: 33649 Max/Min: 475/463 Fail: 0

Under the old locktorture I originally tested against (or patching it
back in), where torture_rwsem_write_delay basically did
"mdelay(long_hold / 10)" when it wasn't doing the long hold, the results
were more clearly beneficial (there's actual noticeable variance in this
config, so I have stats and an example run near the median):

Baseline:
Writes: Total: 11852 Max/Min: 167/164 Fail: 0
Reads : Total: 11858 Max/Min: 169/164 Fail: 0
N Min Max Median Avg Stddev
R 50 10297 15142 11690 11951.72 1437.2562
W 50 10296 15144 11692 11952.04 1437.5479

Patched:
Writes: Total: 27624 Max/Min: 442/334 Fail: 0
Reads : Total: 13957 Max/Min: 197/192 Fail: 0
N Min Max Median Avg Stddev
R 30 13725 17133 13902 14458.467 1199.125
W 30 27525 28880 27683 27844.533 372.82701


Which of these locktorture setups is a more useful test in general I
don't know, the patch removing the mdelay ("locktorture: Add long_hold
to adjust lock-hold delays") left in some of the shortdelay type
mdelay/udelay calls for other locks.

Of course, whenever proxy execution lands, the downsides of the current
nearly 100% fifo behavior will be gone.


---
kernel/locking/percpu-rwsem.c | 55 +++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 185bd1c906b01..251c1a7a94e40 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -112,22 +112,22 @@ static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader)
*
* We use EXCLUSIVE for both readers and writers to preserve FIFO order,
* and play games with the return value to allow waking multiple readers.
*
* Specifically, we wake readers until we've woken a single writer, or until a
- * trylock fails.
+ * check of the sem->block value fails.
*/
static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
unsigned int mode, int wake_flags,
void *key)
{
bool reader = wq_entry->flags & WQ_FLAG_CUSTOM;
struct percpu_rw_semaphore *sem = key;
struct task_struct *p;

- /* concurrent against percpu_down_write(), can get stolen */
- if (!__percpu_rwsem_trylock(sem, reader))
+ /* racy, just an optimization to stop waking if the lock is taken */
+ if (atomic_read(&sem->block))
return 1;

p = get_task_struct(wq_entry->private);
list_del_init(&wq_entry->entry);
smp_store_release(&wq_entry->private, NULL);
@@ -138,32 +138,44 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
return !reader; /* wake (readers until) 1 writer */
}

static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
{
- DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
bool wait;
+ bool first = true;

- spin_lock_irq(&sem->waiters.lock);
- /*
- * Serialize against the wakeup in percpu_up_write(), if we fail
- * the trylock, the wakeup must see us on the list.
- */
- wait = !__percpu_rwsem_trylock(sem, reader);
- if (wait) {
- wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
- __add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
- }
- spin_unlock_irq(&sem->waiters.lock);
+ do {
+ DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);

- while (wait) {
- set_current_state(TASK_UNINTERRUPTIBLE);
- if (!smp_load_acquire(&wq_entry.private))
- break;
- schedule();
- }
- __set_current_state(TASK_RUNNING);
+ spin_lock_irq(&sem->waiters.lock);
+ /*
+ * Serialize against the wakeup in percpu_up_write(), if we fail
+ * the trylock, the wakeup must see us on the list.
+ */
+ wait = !__percpu_rwsem_trylock(sem, reader);
+ if (wait) {
+ wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
+ /*
+ * If we wake but lose a race for the lock, preserve
+ * FIFO-ish behavior by skipping the queue
+ */
+ if (first)
+ __add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
+ else
+ __add_wait_queue(&sem->waiters, &wq_entry);
+ first = false;
+ }
+ spin_unlock_irq(&sem->waiters.lock);
+
+ while (wait) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!smp_load_acquire(&wq_entry.private))
+ break;
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+ } while (wait && !__percpu_rwsem_trylock(sem, reader));
}

bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
{
if (__percpu_down_read_trylock(sem))
--
2.43.0.429.g432eaa2c6b-goog