Hi Waiman,
I'm not subscribed to the lkml so I'm writing to you instead. In your patch you put smp_wmb() at the end of spin_unlock():
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * This unlock function may get inlined into the caller. The
+ * compiler barrier is needed to make sure that the lock will
+ * not be released before changes to the critical section is done.
+ * The ending smp_wmb() allows queue head process to get the lock ASAP.
+ */
+ barrier();
+#ifndef QSPINLOCK_MANYCPUS
+ {
+ u16 qcode = ACCESS_ONCE(lock->locked);
+ if (unlikely(qcode != QSPINLOCK_LOCKED))
+ queue_spin_unlock_slowpath(lock, qcode);
+ }
+#endif
+ ACCESS_ONCE(lock->locked) = 0;
+ smp_wmb();
+}
Isn't a race possible if another thread acquires the spinlock in the window between setting lock->locked to 0 and issuing smp_wmb()? Writes from the critical section from our thread might be delayed behind the write to lock->locked if the corresponding cache bank is busy.
And shouldn't it be a full memory barrier (smp_mb()) instead? Because we want loads from critical section to finish too before giving spinlock to another thread (which might change memory contents).
Comment in queue_spin_unlock() says: "The ending smp_wmb() allows queue head process to get the lock ASAP". But memory barriers can be costly on some architectures so issuing one just to empty write buffers might result in performance loss on them. Such things should be left to hardware IMHO.
So I think unlock() should look like this:
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * Wait for the critical section to finish memory accesses.
+ */
+ smp_mb();
+
+#ifndef QSPINLOCK_MANYCPUS
+ {
+ u16 qcode = ACCESS_ONCE(lock->locked);
+ if (unlikely(qcode != QSPINLOCK_LOCKED))
+ queue_spin_unlock_slowpath(lock, qcode);
+ }
+#endif
+ ACCESS_ONCE(lock->locked) = 0;
+}
Also I don't understand why there are so many uses of ACCESS_ONCE() macro. It does not guarantee memory ordering with regard to other CPUs, so probably most of the uses can be removed (with exception of lock_is_contended(), where it prohibits gcc from optimizing the loop away).