Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Addcmpxchg support for ARMv6+ systems)

From: Mathieu Desnoyers
Date: Tue May 26 2009 - 13:23:36 EST


* Russell King - ARM Linux (linux@xxxxxxxxxxxxxxxx) wrote:
> On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > > index bd4dc8e..e9889c2 100644
> > > --- a/arch/arm/include/asm/system.h
> > > +++ b/arch/arm/include/asm/system.h
> > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > unsigned int tmp;
> > > #endif
> > >
> > > + smp_mb();
> > > +
> > > switch (size) {
> > > #if __LINUX_ARM_ARCH__ >= 6
> > > case 1:
> > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > " bne 1b"
> > > : "=&r" (ret), "=&r" (tmp)
> > > : "r" (x), "r" (ptr)
> > > - : "memory", "cc");
> > > + : "cc");
> >
> > I would not remove the "memory" constraint in here. Anyway it's just a
> > compiler barrier, I doubt it would make anyting faster (due to the
> > following smp_mb() added), but it surely makes things harder to
> > understand.
>
> If you don't already know that smp_mb() is always a compiler barrier then
> I guess it's true, but then if you don't know what the barriers are defined
> to be, should you be trying to understand atomic ops?
>

The "memory" constaint in a gcc inline assembly has always been there to
tell the compiler that the inline assembly has side-effects on memory so
the compiler can take the appropriate decisions wrt optimisations.

Removing the "memory" clobber is just a bug.

The compiler needs the "memory" clobber in the inline assembly to know
that it cannot be put outside of the smp_mb() "memory" clobbers. If you
remove this clobber, the compiler is free to move the inline asm outside
of the smp_mb()s as long as it only touches registers and reads
immediate values.

> > If we determine that some reorderings are not possible on ARM, we might
> > eventually change rmb() for a simple barrier() and make atomic op
> > barriers a bit more lightweight.
>
> There seems to be no "read memory barrier" on ARM - the dmb instruction
> can be restricted to writes, but not just reads.
>

Yes, I've noticed that too. BTW binutils seems broken and does not
support "dmb st" :-(

> Moreover, there's a comment in the architecture reference manual that
> implementations which don't provide the other dmb variants must default
> to doing the full dmb, but then goes on to say that programs should not
> rely on this (what... that the relaxed dmb()s have any effect what so
> ever?)

Whoah.. that sounds odd.

>
> Suggest we don't go anywhere near those until ARMv7 stuff has matured
> for a few years and these details resolved.

I think we have more insteresting details in the following reference
which tells us we are in the right direction. I wonder about
smp_read_barrier_depends though. Maybe we should add a dmb in there
given how permissive the ordering semantic seems to be.

ARM v7-M Architecture Application Level Reference

http://jedrzej.ulasiewicz.staff.iiar.pwr.wroc.pl/KomputeroweSystSter/seminarium/materialy/arm/ARMv7_Ref.pdf

"A3.4.6 Synchronization primitives and the memory order model

The synchronization primitives follow the memory ordering model of the
memory type accessed by the instructions. For this reason:

â Portable code for claiming a spinlock must include a DMB instruction
between claiming the spinlock and making any access that makes use of
the spinlock.
â Portable code for releasing a spinlock must include a DMB instruction
before writing to clear the spinlock.

This requirement applies to code using the
Load-Exclusive/Store-Exclusive instruction pairs, for example
LDREX/STREX."

(this means that adding the dmb as we are doing is required)


A3.7.3 - Ordering requirements for memory accesses

Reading this, especially the table detailing the "Normal vs Normal"
memory operation order, makes me wonder if we should map

#define smp_read_barrier_depends dmb()

Because two consecutive reads are not guaranteed to be globally
observable in program order. This means :

cpu A

write data
wmb()
update ptr

cpu B

cpyptr = rcu_dereference(ptr);
if (cpyptr)
access *cpyptr data

cpu B could see the new ptr cache-line before the data cache-line, which
means we can read garbage.

But currently, only the old alphas have this requirement. I wonder if
it's just me missing some very important detail, but the documentation
about ordering seems permissive enough that it could potentially allow
such reordering.

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/