[PATCH-tip v6 19/22] TP-futex, doc: Update TP futexes document on shared locking

From: Waiman Long
Date: Wed Mar 22 2017 - 13:42:38 EST


The tp-futex.txt was updated to add description about shared locking
support.

Running the futex locking microbenchmark on a 2-socket 36-core E5-2699
v3 system (HT off) running on a 4.11 based kernel, the performance
of TP rwlock with 1:1 reader/writer ratio versus one based on the
wait-wake futexes as well as the Glibc rwlock with a microbenchmark
(1 worker thread per cpu core) running for 10s with a load-to-load
latency of 10 were as follows:

WW futex TP futex Glibc
-------- -------- -----
Total locking ops 30,992,772 49,786,118 14,856,400
Per-thread avg/sec 86,004 138,280 41,267
Per-thread min/sec 81,158 126,338 40,930
Per-thread max/sec 93,940 158,781 41,929
Write lock slowpaths 6,806,849 7,717,125 -
Write unlock slowpaths 8,996,339 5 -
Read lock slowpaths 2,317,716 3,883,809 -
Read unlock slowpaths 4,325,494 2 -

The throughput of TP futex as a read/write lock was about 1.6X of
the WW futex and was about 3.4X of glibc.

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

WW futex TP futex Glibc
-------- -------- -----
90% readers 112,868 113,925 62,321
95% readers 132,809 110,293 67,664
100% reader 152,015 167,902 79,087

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (l) and reader-preferring rwlocks were:

WW futex TP futex Glibc
-------- -------- -----
Reader rate, l=1 330,787 278,192 172,245
Writer rate, l=1 0 13,804 0

Reader rate, l=5 292,609 267,004 161,246
Writer rate, l=5 0 14,995 0

Reader rate, l=50 324,042 251,083 131,786
Writer rate, l=50 0 4,658 0

The corresponding locking rates with writer-preferring rwlocks are:

WW futex TP futex Glibc
-------- -------- -----
Reader rate, L=1 118,959 276,180 68
Writer rate, L=1 24,966 24,739 75,969

Reader rate, L=5 123,325 280,492 81
Writer rate, L=5 28,328 21,874 82,331

Reader rate, L=50 88,673 280,288 4
Writer rate, L=50 20,442 5,816 78,522

With both the WW futex and Glibc rwlocks, lock starvation happened
for the writers with reader-preferring rwlocks. For writer preferring
rwlocks, the WW futex one fared better. The Glibc one, however,
was close to starving the readers.

Lock starvation should not happen on the TP futexes as long as the
underlying kernel mutex is lock starvation free which is the case
for 4.10 and later kernels.

On a 2-socket 10-core and 80-thread Power8 system, the benchmark
results were:
WW futex TP futex Glibc
-------- -------- -----
Total locking ops 8,713,156 18,897,676 4,707,780
Per-thread avg/sec 10,888 23,598 5,884
Per-thread min/sec 9,097 22,028 5,763
Per-thread max/sec 13,106 24,208 6,007
Write lock slowpaths 2,100,149 4,078,698 -
Write unlock slowpaths 2,324,093 6 -
Read lock slowpaths 500,372 2,160,265 -
Read unlock slowpaths 1,210,698 1 -

In this case, the TP futex is more than 2X the performance of the
WW futex.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
Documentation/tp-futex.txt | 170 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 150 insertions(+), 20 deletions(-)

diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
index d040c18..5c781a5 100644
--- a/Documentation/tp-futex.txt
+++ b/Documentation/tp-futex.txt
@@ -80,22 +80,62 @@ Userspace locking can improve throughput by reducing lock hold time,
but it also has the risk of lock starvation. So kernel locking should
be used after a number of userspace locking failures.

-Inside the kernel, a kernel mutex is used for serialization among
-the futex waiters. Only the top lock waiter which is the owner of
-the serialization mutex is allowed to continuously spin and attempt
-to acquire the lock. Other lock waiters will have one attempt to
-steal the lock before entering the mutex queues.
+A shared lock acquirer, on the other hand, has to do either one of
+the following 2 actions depends on the current value of the futex:
+
+ 1) If current futex value is 0, atomically put (FUTEX_SHARED + 1)
+ into the lower 30 bits of the futex.
+
+ 2) If the FUTEX_SHARED bit is set and the FUTEX_SHARED_UNLOCK bit
+ isn't, atomically increment the reader count in the lower 24 bits.
+ The other unused bits (24-27) can be used for other management purpose
+ and will be ignored by the kernel.
+
+Any failure to both of the above actions will lead to the following
+new FUTEX_LOCK_SHARED futex(2) syscall.
+
+ futex(uaddr, FUTEX_LOCK_SHARED, prefer_reader, timeout, NULL, 0);
+
+The special FUTEX_SHARED bit (bit 30) is used to distinguish between a
+shared lock and an exclusive lock. If the FUTEX_SHARED bit isn't set,
+the lower 29 bits contain the TID of the exclusive lock owner. If
+the FUTEX_SHARED bit is set, the lower 29 bits contain the number of
+tasks owning the shared lock. Therefore, only 29 bits are allowed
+to be used by the TID.
+
+The FUTEX_SHARED_UNLOCK bit can be used by userspace to prevent racing
+and to indicate that a shared unlock is in progress as it will disable
+any shared trylock attempt in the kernel.
+
+The prefer_reader flag is used to control if the lock acquisition
+should be done in a prefer reader mode (if set) or a neutral mode
+where there is no preference to either reader or writer.
+
+Inside the kernel, a kernel mutex is used for serialization among the
+futex waiters. Only the top lock waiter which is the owner of the
+serialization mutex is allowed to continuously spin and attempt to
+acquire the lock. Other lock waiters will have one or two attempts
+to steal the lock before entering the mutex queues.

When the exclusive futex lock owner is no longer running, the top
waiter will set the FUTEX_WAITERS bit before going to sleep. This is
to make sure the futex owner will go into the kernel at unlock time
to wake up the top waiter.

+When the futex is shared by multiple owners, there is no way to
+determine if all the shared owners are running or not. In this case,
+the top waiter will spin for a fixed amount of time if no reader
+activity is observed before giving up and going to sleep. Any change
+in the reader count will be considered reader activities and so will
+reset the timer. However, when the hand-off threshold has been reached,
+reader optimistic spinning timer reset will be disabled automatically.
+
The return values of the above futex locking syscall, if non-negative,
-are status code that consists of 2 fields - the lock acquisition code
-(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic
-spinning loop before acquiring the futex. A negative returned value
-means an error has happened.
+are status code that consists of 3 fields - the lock acquisition code
+(bits 0-7) and the number of sleeps (bits 16-30) in the optimistic
+spinning loop before acquiring the futex. Bits 8-15 are reserved
+for future extension. A negative returned value means an error has
+happened.

The lock acquisition code can have the following values:
a) 0 - lock stolen as non-top waiter
@@ -108,14 +148,21 @@ issue a FUTEX_UNLOCK futex(2) syscall to wake up the top waiter.

futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);

-A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates
-a task has been woken up. The syscall returns 0 if no sleeping task
-is woken. A negative value will be returned if an error happens.
+For a shared lock owner, unlock is done by atomically decrementing
+the reader count by one. If the resultant futex value has no reader
+remaining, the process has to atomically change the futex value from
+FUTEX_SHARED to 0. If that fails, it has to issue a FUTEX_UNLOCK_SHARED
+futex(2) syscall to wake up the top waiter.
+
+A return value of 1 from the FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+futex(2) syscall indicates a task has been woken up. The syscall
+returns 0 if no sleeping task is woken. A negative value will be
+returned if an error happens.

-The error number returned by a FUTEX_UNLOCK syscall on an empty futex
-can be used to decide if the TP futex functionality is implemented
-in the kernel. If it is present, an EPERFM error will be returned.
-Otherwise it will return ENOSYS.
+The error number returned by a FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+syscall on an empty futex can be used to decide if the TP futex
+functionality is implemented in the kernel. If it is present, an
+EPERFM error will be returned. Otherwise it will return ENOSYS.

TP futexes require the kernel to have SMP support as well as support
for the cmpxchg functionality. For architectures that don't support
@@ -137,6 +184,13 @@ situation. It is possible for the kernel to handle this internally,
but it will probably reduce performance. Userspace code can handle
this more efficiently.

+Shared locking TP futexes, on the other hand, cannot be used with
+robust futexes. The unexpected death of a shared lock owner may
+leave the futex permanently in the shared state leading to deadlock
+of exclusive lock waiters. If necessary, some kind of timeout and
+userspace lock management system have to be used to resolve this
+deadlock problem.
+
Usage Scenario
--------------

@@ -148,6 +202,24 @@ used when the critical section is long and prone to sleeping. However,
it may not have the performance gain when compared with a wait-wake
futex in this case.

+A TP futex can also be used to implement a userspace reader-writer
+lock (rwlock) where the writers use the FUTEX_LOCK and FUTEX_UNLOCK
+futex(2) syscalls and the readers used the FUTEX_LOCK_SHARED and
+FUTEX_UNLOCK_SHARED futex(2) syscalls. Beside providing a better
+performance compared with wait-wake futexes, other advantages of
+using the the TP futexes for rwlocks are:
+
+ 1) The simplicity and ease of maintenance of the userspace locking
+ codes. There is only one version of rwlock using TP futex versus
+ a reader-preferring variant and/or a writer-preferring variant
+ when using the wait-wake futexes.
+
+ 2) TP futex is lock-starvation free. That may not be the case
+ with rwlocks implemented with wait-wake futexes especially the
+ reader-preferring variants where the writers can be starved of
+ the lock under certain circumstances. Some writer-preferring
+ rwlock variants may also starve readers of getting the lock.
+
The wait-wake futexes are more versatile as they can also be used to
implement other locking primitives like semaphores or conditional
variables. So the TP futex is not a direct replacement of the
@@ -157,12 +229,12 @@ futex is likely a better option than the wait-wake futex.
Sample Code
-----------

-The following are sample code to implement simple mutex lock and
-unlock functions.
+The following are sample code to implement simple mutex or writer
+lock and unlock functions.

__thread int thread_id;

-void mutex_lock(int *faddr)
+void write_lock(int *faddr)
{
if (cmpxchg(faddr, 0, thread_id) == 0)
return;
@@ -170,7 +242,7 @@ void mutex_lock(int *faddr)
;
}

-void mutex_unlock(int *faddr)
+void write_unlock(int *faddr)
{
int old, fval;

@@ -178,3 +250,61 @@ void mutex_unlock(int *faddr)
return;
futex(faddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
}
+
+The following sample code can be used to implement shared or reader
+lock and unlock functions.
+
+void read_lock(int *faddr)
+{
+ int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1);
+
+ if (!val)
+ return;
+ for (;;) {
+ int old = val, new;
+
+ if (!old)
+ new = FUTEX_SHARED + 1;
+ else if (!(old & FUTEX_SHARED) ||
+ (old & ((~FUTEX_SHARED_TID_MASK)|
+ FUTEX_SHARED_UNLOCK)))
+ break;
+ else
+ new = old + 1;
+ val = cmpxchg(faddr, old, new);
+ if (val == old)
+ return;
+ }
+ while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0)
+ ;
+}
+
+void read_unlock(int *faddr)
+{
+ int val = xadd(faddr, -1) - 1;
+ int old;
+
+ for (;;) {
+ /*
+ * Return if not the last reader, not in shared locking
+ * mode or the unlock bit has been set.
+ */
+ if (!(val & FUTEX_SHARED) ||
+ (val & (FUTEX_SHARED_UNLOCK|FUTEX_SCNT_MASK)))
+ return;
+ if (val & ~FUTEX_SHARED_TID_MASK) {
+ old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK);
+ if (old == val)
+ break; /* Do FUTEX_UNLOCK_SHARED */
+ } else {
+ old = cmpxchg(faddr, val, 0);
+ if (old == val)
+ return;
+ }
+ val = old;
+ }
+ futex(faddr, FUTEX_UNLOCK_SHARED, ...);
+}
+
+More elaborate sample locking codes for the TP futexes are also
+available at the tools/perf/bench/futex-locks.c file.
--
1.8.3.1