Re: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up_write

From: Hillf Danton
Date: Fri Jan 26 2024 - 07:43:43 EST


On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@xxxxxxxxxx>
> Hillf Danton <hdanton@xxxxxxxx> writes:
> > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@xxxxxxxxxx>
> >> Hillf Danton <hdanton@xxxxxxxx> writes:
> >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@xxxxxxxxxx>
> >> >> So the actual problem we saw was that one job had severe slowdowns
> >> >> during startup with certain other jobs on the machine, and the slowdowns
> >> >> turned out to be some cgroup moves it did during startup. The antagonist
> >> >> jobs were spawning huge numbers of threads and some other internal bugs
> >> >> were exacerbating their contention. The lock handoff meant that a batch
> >> >> of antagonist threads would receive the read lock of
> >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
> >> >> long time to be scheduled.
> >> >
> >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
> >> > in rwsem_down_read_slowpath().
> >>
> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
> >
> > You and I are not on the same page.
> >
> >> Percpu-rwsem's current code has perfect handoff for read->write, and a very
> >> short window for write->read (or write->write) to be beaten by a new writer.
> >
> > Given no chance left for spin on owner who is legal to take a ten-minute nap,
> > the right thing known to do on behalf of starved waiters is to add the HANDOFF
> > mechanism without any heuristic like you proposed for instance, in order to
> > force lock acquirers to go the slow path.
> >
> > Only for thoughts.
>
> This is not the type of slowdown that is the problem my patch is trying
> to address. (And due to the way percpu-rwsem works sem->ww is nearly
> entirely redundant with sem->block - the first waiting writer is instead
> waiting on rcuwait and holds sem->block while doing so)
>
> The problem that my patch addresses is:
>
> Writer is done: percpu_up_write
> atomic_set_release(&sem->block, 0); // #1
> wake a batch of readers:
> percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2
> wake a single writer
> percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3
> new writer wakes up (holding sem->block from #3)
> sees the readers holding the lock from #2, now sleeps on rcuwait
> time passes // #4
> readers finally get to run, run quickly and release the lock
> now the writer gets to run
>
> Currently the only source of unfairness/optimistic locking is the window
> between #1 and #2, which occur in quick succession, on the same thread,
> and with no SPIN_ON_OWNER to make this window more likely than it
> otherwise would be.

The sem->ww introduced closes the window between #1 and #2 by define
as it is derived from rwsem's HANDOFF.
>
> My patch makes the entire #4 available to writers (or new readers), so
> that the woken writer will instead get to run immediately. This is

Victims rise in case the woken readers at #2 have been waiting more
than a minute while the woken writer less than 20ms.

> obviously much less fair, but provides much better throughput (ideally
> it might have some sort of delay, so that in more normal circumstances
> readers don't have to win the wakeup race by luck and being woken
> slightly sooner, but I don't have that).
>
> This is also only useful because of the assumption that readers will
> almost always not actually block (among other required assumptions) - if

Like heuristic, any assumption makes the locking game more complex than
thought without real win.

> they regularly sleep while holding the lock, then saving writers from
> that first wakeup latency of readers isn't particularly helpful anymore,
> because they'll still be delayed by the other wakeup latencies.