Re: Code optimization <LEA Instruction>

From: Richard B. Johnson (root@chaos.analogic.com)
Date: Fri Jan 28 2000 - 06:54:59 EST


On Thu, 27 Jan 2000, Michael Loftis wrote:

> I'd like to look at this 'test.'
>
> I've proven many times that documentation != implementation. I mean no
> disrespect for Mr. Cox, but often times what my code does and what my
> documentation explains don't exactly line up, and this is true of many
> programmers and chip engineers.
>
> IAE I have *not* been following this thread and apologise if my comment
> is out in left field.

The test shows that LEA should not be arbitrarily used for arithmetic on
an index register. It clearly shows that addition is faster.

It was written to demonstrate this. It demonstrates this. Alan assumed
that I didn't know what I was doing and declared the test invalid, while
it is so provably valid that that such an argument is absurd.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>

int main(void);
void dummy(int unused);

volatile int true;

void dummy(int unused)
{
  true = 0;
}

int main()
{
    long leal = 0;
    long addl = 0;
        printf("Testing...");
        fflush(stdout);
        (void)signal(SIGALRM, dummy);
        (void)alarm(1);
        true++;
        while(true)
        {
            __asm__ __volatile__(
            ".align 8\n"
            "\tleal 2(%eax), %eax\n" /* 10 leals */
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
            "\tleal 2(%eax), %eax\n"
                       );
            leal++;
        }
        (void)signal(SIGALRM, dummy);
        (void)alarm(1);
        true++;
        while(true)
        {
            __asm__ __volatile__(
            ".align 8\n"
            "\taddl $2, %eax\n" /* 10 adds */
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
            "\taddl $2, %eax\n"
                    );
            addl++;
        }
        printf("\nNr adds = %ld Nr leals = %ld\n", addl, leal);
        return 0;
}

Now, LEA generates an address. Address generation takes more time
than simple register operations. If you use LEA for math on a register,
and immediately use the result, your code stalls during this time.

This code was deliberately designed to show that. If, as is done in
the checksum_partial() routines, there are other things that can
be done while the address generation is occurring (in parallel), then
such hand-tweaked code can execute faster than using math on ths index
register.

Unfortunately, I see that much of the indexed address generation
done in new versions of the C compiler abitrarily use LEA, followed
immediately by the memory fetch using the newly generated address.
This will be slower than simple math to obtain the address.

Cheers,
Dick Johnson

Penguin : Linux version 2.3.39 on an i686 machine (800.63 BogoMips).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 21:00:20 EST