Re: [patch] waitqueue optimization, 2.4.0-test7

From: Andrea Arcangeli (
Date: Tue Aug 29 2000 - 09:03:32 EST

On Tue, 29 Aug 2000, David S. Miller wrote:

> Date: Tue, 29 Aug 2000 04:26:18 +0200 (CEST)
> From: Andrea Arcangeli <>
> Alpha definitely needs mb(), ld_l/st_c doesn't imply any memory
> barrier (yes, alpha is very aggressive in SMP 8).
>I wonder if it then makes sense to just provide a set of bitops
>interfaces then which enforces the barrier. Or perhaps the other way

And where do you want to enforce the barrier? Before or after clear_bit,
or both? Do you think we should add mb_clear_bit() and clear_bit_mb() and
mb_clear_bit_mb()? I'm not sure that's cleaner than using mb();
clear_bit(); mb(), however I see that only alpha due its SMP capability
would get the explicit mb() at zero additinal cost. Oh well...

Anyway what we call now clear_bit() should not have any implicit memory
barrier. clear_bit is just an atomic function, it doesn't have anything to
do with memory barriers. (I dislike the __clear_bit approch)

I'd prefer to have things like UnlockPage implemented as the architecture
prefers (with a safe SMP common code default) instead of exporting zillons
of mb_clear_bit_mb and friends variant (same for set_bit and other
friends). Maybe the architecture can do even more deep assumption than
what we can export via an API bloating mb_clear_bit_mb().

Note that actually we just now enforce an implicit memory barrier _after_
test_and_set/clear/change_bit (to be able to use them like spinlocks to
build a critical section). For example mm/filemap.c:lock_page() relys on
test_and_set_bit (in TryLockPage) to do a mb() _after_ setting the
PG_locked bitflag and of course the alpha port is doing it see:

        __asm__ __volatile__(
        "1: ldl_l %0,%1\n"
        " and %0,%3,%2\n"
        " bne %2,2f\n"
        " xor %0,%3,%0\n"
        " stl_c %0,%1\n"
        " beq %0,3f\n"
        " mb\n"
        ".subsection 2\n"
        "3: br 1b\n"
        :"=&r" (temp), "=m" (*m), "=&r" (oldbit)
        :"Ir" (1UL << (nr & 31)), "m" (*m));

And surprise: I found a SMP race in UnlockPage. What we're doing now it's
of the kind of:

        long lock = 0; (page->flags)
        while (test_and_set_bit(&lock, 0)) (TryLockPage())

        -- critical section --

        lock = 0; (clear_bit())

And the above will allow the `lock = 0' (in our case clear_bit) to be
reordered inside the critical section causing a plain SMP race. To fix it
we do need an explicit mb() _before_ clear_bit() (and after the critical
section of course). And for using the wake_up optimization
(waitqueue_active()) we need mb() _after_ clear_bit(). Fun ;). So we need
both before and after for two very different reasons (first mb is for
closing the critical section before releasing the lock, second mb is for
ordering rules of the wait event interface :)

Furthmore what we need aren't really mb() but smb_mb() that are noops in
an UP compilation.

Talking about common code the code I'd prefer to write is this:

lock_page() {
        long lock = 0;
        while (TryLockPAge(page)) /* test_and_set_bit doesn't include implicit mb() */

        -- critical section --

UnlockPage() {
        if (waitqueue_active())

The above would fix the SMP race and it would be the most finegrined
implementation that also avoids suprious mb() in the slow path of
lock_page(). It would be also the most efficient implementation for the
alpha :)) IMHO it also tends to be less misleading since there will be a
smb_mb() at the start of the critical section too so we won't forgot of
the second _necessary_ one. And we won't do a suprious mb while doing the
test_and_set_bit(PG_referenced) that doesn't need the mb.

However with the above common code to optimize lock_page we should
re-implemented per-arch (ala string.h with overriding #define) for IA32
and sparc64 to avoid wasting cycles in suprious mb() that happen to be
implicit in the bitops at least for such two architectures.

I'm checking if it's not a pain to audit all the
test_and_set/clear/change_bit to avoid adding the implicit barrier in them
and move the stuff that we want to be optimized in the arch section.
Meanwhile hints and comments are welcome.

If you prefer a zillon of mb_clear_bit_mb and all other possible
combinations we can do that of course too (optimizing the slow path of
lock_page shouldn't be very worthwhile optimization, I see).

BTW, (changing topic a bit) after your lesson about the meaning of sparc64
membar I seen you never use the rmb/wmb inside the sparc64 code so I think
you can just optimize sparc64 to be more finegrined (and yes wmb should
have the alpha asm("wmb") meaning :) and rmb should be the exact
counterpart to avoid obvious mistakes where nobody knows anymore what such
two macros means :).

--- 2.4.0-test5/include/asm-sparc64/system.h Sat Jun 24 16:03:02 2000
+++ /tmp/system.h Fri Jul 28 14:23:09 2000
@@ -100,8 +100,8 @@
 #define nop() __asm__ __volatile__ ("nop")
 #define membar(type) __asm__ __volatile__ ("membar " type : : : "memory");
-#define rmb() membar("#LoadLoad | #LoadStore")
-#define wmb() membar("#StoreLoad | #StoreStore")
+#define rmb() membar("#LoadLoad")
+#define wmb() membar("#StoreStore")
 #define set_mb(__var, __value) \
         do { __var = __value; membar("#StoreLoad | #StoreStore"); } while(0)
 #define set_rmb(__var, __value) \

And this changes brlock.h accordingly so that it's implemented in the
alpha rmb/wmb meanings:

--- 2.4.0-test7-pre5/include/linux/brlock.h Tue Aug 22 01:30:04 2000
+++ /tmp/brlock.h Tue Aug 22 01:56:53 2000
@@ -114,10 +114,15 @@
         lock = &__br_write_locks[idx].lock;
- rmb();
+ mb();
         if (spin_is_locked(lock)) {
- rmb();
+ wmb(); /*
+ * The release of the ctr must become visible
+ * to the other cpus eventually thus wmb(),
+ * we don't care if spin_is_locked is reordered
+ * before the releasing of the ctr.
+ */
                 while (spin_is_locked(lock))
                 goto again;

(btw you and Alan previously said me that the wmb() is strictly necessary
(Alan said at least in theory :) but now I well remebered why I said wmb
isn't necessary, that's because spinlocks would otherwise deadlock at
least on alpha because in spin_unlock() we do:

        lock = 0;
        /* none wmb() */

Now assume after the spin_unlock we just infinite loop:

        for (;;)

so never causing the write buffer to be full (the write buffer have only
the `lock' memory set to zero in it).

and the other task in the slow path is doing:

        while (test_and_set_bit(&lock))
                while (lock); /* other task is looping here */

(if the lock set to zero by the other CPU running spin_unlock would never
become visible this CPU would deadlock in the slow path of the spin_lock)

I guess/hope we don't rely on the random stuff after spin_unlock to cause
the write buffer to be flushed because of lots of writes in memory :)

I always read wmb/rmb are there to avoid reodering, that looked a matter
of ordering of visibility only, not of happening or not happening.

Comments on this?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Thu Aug 31 2000 - 21:00:23 EST