RE: [PATCH v4] x86: use builtins to read eflags

From: David Laight
Date: Fri Feb 11 2022 - 17:10:06 EST


From: Bill Wendling > Sent: 11 February 2022 19:26
>
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
> > >
> > > However, using builtins is beneficial in general, because it removes the
> > > burden of determining what's the way to read the flags register from the
> > > programmer and places it on to the compiler, which has the information
> > > needed to make that decision.
> >
> > Except that neither gcc nor clang attempt to make that decision.
> > They always do pushf; pop ax;
> >
> It looks like both GCC and Clang pop into virtual registers. The
> register allocator is then able to determine if it can allocate a
> physical register or if a stack slot is required.

Doing:
int fl;
void f(void) { fl = __builtin_ia32_readeflags_u64(); }
Seems to use register.
If it pops to a virtual register it will probably never pop
into a real target location.

See https://godbolt.org/z/8aY8o8rhe

But performance wise the pop+mov is just one byte longer.
Instruction decode time might be higher for two instruction, but since
'pop mem' generates 2 uops (intel) it may be constrained to the first
decoder (I can't rememberthe exact details), but the separate pop+mov
can be decoded in parallel - so could end up faster.

Actual execution time (if that makes any sense) is really the same.
Two operations, one pop and one memory write.

I bet you'd be hard pressed to find a piece of code where it even made
a consistent difference.

> > ...
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.
>
> > Seems better to change it (back) to "=r" and comment that this
> > is currently as good as __builtin_ia32_readeflags_u64() and that
> > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > benefit.
> >
> That would be okay as far as code generation is concerned, but it does
> place the burden of correctness back on the programmer. Also, it was
> that at some point, but was changed to "=rm" here. :-)

As I said, a comment should stop the bounce.

...
> I was able to come up with an example where GCC generates "pushf ; pop mem":
>
> https://godbolt.org/z/9rocjdoaK
>
> (Clang generates a variation of "pop mem," and is horrible code, but
> it's meant for demonstration purposes only.) One interesting thing
> about the use of the builtins is that if at all possible, the "pop"
> instruction may be moved away from the "pushf" if it's safe and would
> reduce register pressure.

I wouldn't trust the compiler to get stack pointer relative accesses
right if it does move them apart.
Definitely scope for horrid bugs ;-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)