Re: a.out binaries that are 66% faster than ELF, problem found?

Bernd Schmidt (crux@Pool.Informatik.RWTH-Aachen.DE)
Mon, 3 Mar 1997 14:12:03 +0100 (MET)


>
> ------------------------------------------------------------>
> gcc-2.5.8 (fast) gcc-2.7.2.1 (slow)
>
> RC5_KEY_CHECK: RC5_KEY_CHECK:
> subl $12,%esp subl $304,%esp
> <-----------------------------------------------------------
>
> Look at the stack size difference. Now RC5_CHECK is different, there are
> no 'lost stack slots', but lots of spilled registers:
>
> 08048e85 <RC5_KEY_CHECK+2f5> roll $0x3,%eax
> 08048e88 <RC5_KEY_CHECK+2f8> movl %eax,0x10c(%esp,1)
> 08048e8f <RC5_KEY_CHECK+2ff> movl %eax,0x804c210
> 08048e94 <RC5_KEY_CHECK+304> addl 0x10c(%esp,1),%edx

Okay, over the weekend I looked at this problem. The code above is silly, and
gcc 2.5.8 does a lot better (i.e., the last instruction would have been
"addl %eax,%edx"). What happens is this.

The roll is an asm pattern, where a pseudo register (let's call it reg1) is
set. This has a very long lifetime, and GCC can't allocate a hard register.
So, it's spilled to 0x10c(%esp) (second instruction). The third instruction
is fine, it comes from the C code. In the last instruction, reg1 is referenced
again. The compiler knows it has been spilled, and that it is in memory now.
But it still scans the previous insn to see whether it can still find the
value in a register (function find_equiv_reg in reload.c). In GCC 2.5.8,
find_equiv_reg finds the equivalence made by instruction 2 and uses eax instead
of the stack slot. GCC 2.7.2.1 has a bugfix applied in find_equiv_reg which
makes it think that the store to memory in instruction 3 can alias with the
stack slot for reg1. This is very conservative, and in this case it's plain
that a store to a global variable can't alias with a store to the stack.
This would be rather easy to fix, but it would still do the wrong thing if
instruction 2 stored to a (non-aliasing) stack slot.

(btw, I still can't explain the stack size difference. Maybe GCC 2.5.8 does
CSE less good than 2.7.2.1 so that the pseudoregisters have shorter lifetimes
and can share stack slots.)

Bernd