Re: single copy atomicity for double load/stores on 32-bit systems

From: Vineet Gupta
Date: Mon Jul 01 2019 - 16:05:58 EST


On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
>> Hi Peter,
>>
>> Had an interesting lunch time discussion with our hardware architects pertinent to
>> "minimal guarantees expected of a CPU" section of memory-barriers.txt
>>
>>
>> | (*) These guarantees apply only to properly aligned and sized scalar
>> | variables. "Properly sized" currently means variables that are
>> | the same size as "char", "short", "int" and "long". "Properly
>> | aligned" means the natural alignment, thus no constraints for
>> | "char", two-byte alignment for "short", four-byte alignment for
>> | "int", and either four-byte or eight-byte alignment for "long",
>> | on 32-bit and 64-bit systems, respectively.
>>
>>
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>
> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>
> For any u64 type, that would give 8 byte alignment. the problem
> otherwise being that your data spans two lines/pages etc..
>
>> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
>> be atomic unless 8-byte aligned
>>
>> ARMv7 arch ref manual seems to confirm this. Quoting
>>
>> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
>> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
>> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
>> | subsequence of two or more word accesses from the sequence might not exhibit
>> | single-copy atomicity
>>
>> While it seems reasonable form hardware pov to not implement such atomicity by
>> default it seems there's an additional burden on application writers. They could
>> be happily using a lockless algorithm with just a shared flag between 2 threads
>> w/o need for any explicit synchronization.
>
> If you're that careless with lockless code, you deserve all the pain you
> get.
>
>> But upgrade to a new compiler which
>> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
>> causing the code to suddenly stop working. Is the onus on them to declare such
>> memory as c11 atomic or some such.
>
> When a programmer wants guarantees they already need to know wth they're
> doing.
>
> And I'll stand by my earlier conviction that any architecture that has a
> native u64 (be it a 64bit arch or a 32bit with double-width
> instructions) but has an ABI that allows u32 alignment on them is daft.

So I agree with Paul's assertion that it is strange for 8-byte type being 4-byte
aligned on a 64-bit system, but is it totally broken even if the ISA of the said
64-bit arch allows LD/ST to be augmented with acq/rel respectively.

Say the ISA guarantees single-copy atomicity for aligned cases (i.e. for 8-byte
data only if it is naturally aligned) and in lack thereof programmer needs to use
the proper acq/release

In my earlier example on lockless code, we do assume that programmer will use a
release in the update of flag.