Re: [PATCH 8/7] net/netfilter/nf_conntrack_core: Remove another memory barrier

From: Manfred Spraul
Date: Mon Sep 05 2016 - 14:57:33 EST


This is a multi-part message in MIME format. Hi Peter,

On 09/02/2016 09:22 PM, Peter Zijlstra wrote:
On Fri, Sep 02, 2016 at 08:35:55AM +0200, Manfred Spraul wrote:
On 09/01/2016 06:41 PM, Peter Zijlstra wrote:
On Thu, Sep 01, 2016 at 04:30:39PM +0100, Will Deacon wrote:
On Thu, Sep 01, 2016 at 05:27:52PM +0200, Manfred Spraul wrote:
Since spin_unlock_wait() is defined as equivalent to spin_lock();
spin_unlock(), the memory barrier before spin_unlock_wait() is
also not required.
Note that ACQUIRE+RELEASE isn't a barrier.

Both are semi-permeable and things can cross in the middle, like:


x = 1;
LOCK
UNLOCK
r = y;

can (validly) get re-ordered like:

LOCK
r = y;
x = 1;
UNLOCK

So if you want things ordered, as I think you do, I think the smp_mb()
is still needed.
CPU1:
x=1; /* without WRITE_ONCE */
LOCK(l);
UNLOCK(l);
<do_semop>
smp_store_release(x,0)


CPU2;
LOCK(l)
if (smp_load_acquire(x)==1) goto slow_path
<do_semop>
UNLOCK(l)

Ordering is enforced because both CPUs access the same lock.

x=1 can't be reordered past the UNLOCK(l), I don't see that further
guarantees are necessary.

Correct?
Correct, sadly implementations do not comply :/ In fact, even x86 is
broken here.

I spoke to Will earlier today and he suggests either making
spin_unlock_wait() stronger to avoids any and all such surprises or just
getting rid of the thing.
I've tried the trivial solution:
Replace spin_unlock_wait() with spin_lock();spin_unlock().
With sem-scalebench, I get around a factor 2 slowdown with an array with 16 semaphores and factor 13 slowdown with an array of 256 semaphores :-(
[with LOCKDEP+DEBUG_SPINLOCK].

Anyone around with a ppc or arm? How slow is the loop of the spin_unlock_wait() calls?
Single CPU is sufficient.

Question 1: How large is the difference between:
#./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 1
#./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 256
https://github.com/manfred-colorfu/ipcscale

For x86, the difference is only ~30%.

Question 2:
Is it faster if the attached patch is applied? (relative to mmots)

--
Manfred