Re: [RFC][PATCH] Faster generic_fls

From: Willy TARREAU (willy@w.ods.org)
Date: Thu May 01 2003 - 08:52:04 EST

Next message: Steven Whitehouse: "Re: Remains of seq_file conversion for DECnet, plus fixes"
Previous message: Dirk Eddelbuettel: "Tyan 2466 SMP locks hard with 2.4.20 + heavy disk i/o yet runs 2.2.* without problems"
In reply to: hugang: "Re: [RFC][PATCH] Faster generic_fls"
Next in thread: Falk Hueffner: "Re: [RFC][PATCH] Faster generic_fls"
Reply: Falk Hueffner: "Re: [RFC][PATCH] Faster generic_fls"
Reply: hugang: "Re: [RFC][PATCH] Faster generic_fls"
Reply: Lincoln Dale: "Re: [RFC][PATCH] Faster generic_fls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Hello
>
> Here is the table version of fls. Before version of it is wrong.

Ok, I recoded the tree myself with if/else, and it's now faster than all others,
whatever the compiler. I have cut the tree to only 16 bits evaluations because
it was still faster than a full 32 bits tree on x86, probably because of the
code length. However, on Alpha EV6 (gcc-2.95.3), the 32 bits tree is faster.
The 32 bits code is also the same speed as the 16 bits on gcc-3.2.3 if I
specify -mbranch-cost=2 (because gcc assumes that today's CPUs need one cycle
per jump).

Here is the function, followed by the logs executed on an Athlon.

Disclaimer: Daniel's PentiumIII seems to give very different results than my
Athlon, so every test should be considered for one arch only !!!

To quickly resume the results, on gcc-3.2.3 it's 45% faster than the table algo,
I don't understand why, and just a little bit faster than Daniel's function.

On gcc-2.95.3, it's only 20% faster than the table, but 46% than Daniel's.

However, I have good news for you, the table code is 15% faster than my tree
on an Alpha EV6 ;-)
It may be because gcc can use different tricks on this arch.

Cheers
Willy

static inline int fls_tree_fls(unsigned n) {
#ifndef REALLY_WANT_A_32BITS_TREE
    /* it seems as if working on half a tree is faster than the
       complete tree.
    */
    register t = 0;
    if (n >= (1<<16)) {
        n >>= 16;
        t = 16;
    }
    if (n < (1<<8))
        if (n < (1<<4))
            if (n < 4)
                return t + n - ((n + 1) >> 2);
            else
                return t + 3 + (n >= 8);
        else
            if (n < (1<<6))
                return t + 5 + (n >= (1<<5));
            else
                return t + 7 + (n >= (1<<7));
    else
        if (n < (1<<12))
            if (n < (1<<10))
                return t + 9 + (n >= (1<<9));
            else
                return t + 11 + (n >= (1<<11));
        else
            if (n < (1<<14))
                return t + 13 + (n >= (1<<13));
            else
                return t + 15 + (n >= (1<<15));
#else
    /* perhaps this is faster on systems with BIG caches, but on
       athlon, at least, it's slower than the previous one
    */
    if (n < (1<<16))
        if (n < (1<<8))
            if (n < (1<<4))
                if (n < 4)
                    return n - ((n + 1) >> 2);
                else
                    return 3 + (n >= 8);
            else
                if (n < (1<<6))
                    return 5 + (n >= (1<<5));
                else
                    return 7 + (n >= (1<<7));
        else
            if (n < (1<<12))
                if (n < (1<<10))
                    return 9 + (n >= (1<<9));
                else
                    return 11 + (n >= (1<<11));
            else
                if (n < (1<<14))
                    return 13 + (n >= (1<<13));
                else
                    return 15 + (n >= (1<<15));
    else
        if (n < (1<<24))
            if (n < (1<<20))
                if (n < (1<<18))
                    return 17 + (n >= (1<<17));
                else
                    return 19 + (n >= (1<<19));
            else
                if (n < (1<<22))
                    return 21 + (n >= (1<<21));
                else
                    return 23 + (n >= (1<<23));
        else
            if (n < (1<<28))
                if (n < (1<<26))
                    return 25 + (n >= (1<<25));
                else
                    return 27 + (n >= (1<<27));
            else
                if (n < (1<<30))
                    return 29 + (n >= (1<<29));
                else
                    return 31 + (n >= (1<<31));
#endif
}

== results on Athlon 1800+ (1.53 GHz), gcc-2.95.3 -O3 -march=i686 :

4294967295 iterations of nil... checksum = 4294967295

real 0m5.600s
user 0m5.590s
sys 0m0.007s
4294967295 iterations of old... checksum = 4294967265

real 0m39.640s
user 0m39.499s
sys 0m0.141s
4294967295 iterations of new... checksum = 4294967265

real 0m45.088s
user 0m44.936s
sys 0m0.158s
4294967295 iterations of fls_table... checksum = 4294967264

real 0m35.988s
user 0m35.833s
sys 0m0.159s
4294967295 iterations of fls_tree... checksum = 4294967265

real 0m28.699s
user 0m28.605s
sys 0m0.096s

== results on Athlon 1800+ (1.53 GHz), gcc-3.2.3 -O3 -march=athlon :

4294967295 iterations of nil... checksum = 4294967295

real 0m16.624s
user 0m16.624s
sys 0m0.001s
4294967295 iterations of old... checksum = 4294967265

real 0m57.685s
user 0m57.568s
sys 0m0.125s
4294967295 iterations of new... checksum = 4294967265

real 0m37.513s
user 0m37.415s
sys 0m0.103s
4294967295 iterations of fls_table... checksum = 4294967264

real 0m56.068s
user 0m55.991s
sys 0m0.085s
4294967295 iterations of fls_tree...
checksum = 4294967265

real 0m36.636s
user 0m36.516s
sys 0m0.125s

=== alpha EV6 / 466 Mhz, gcc-2.95.3 -O3 ====
4294967295 iterations of new... checksum = 4294967265

real 2m19.951s
user 2m19.543s
sys 0m0.018s
4294967295 iterations of fls_table... checksum = 4294967265

real 1m12.825s
user 1m12.667s
sys 0m0.005s
4294967295 iterations of fls_tree... checksum = 4294967265

real 1m33.469s
user 1m33.242s
sys 0m0.013s

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Steven Whitehouse: "Re: Remains of seq_file conversion for DECnet, plus fixes"
Previous message: Dirk Eddelbuettel: "Tyan 2466 SMP locks hard with 2.4.20 + heavy disk i/o yet runs 2.2.* without problems"
In reply to: hugang: "Re: [RFC][PATCH] Faster generic_fls"
Next in thread: Falk Hueffner: "Re: [RFC][PATCH] Faster generic_fls"
Reply: Falk Hueffner: "Re: [RFC][PATCH] Faster generic_fls"
Reply: hugang: "Re: [RFC][PATCH] Faster generic_fls"
Reply: Lincoln Dale: "Re: [RFC][PATCH] Faster generic_fls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed May 07 2003 - 22:00:13 EST