Re: nasm over gas?

From: Jamie Lokier
Date: Mon Sep 08 2003 - 12:00:27 EST



Ihar 'Philips' Filipau wrote:
> Jamie Lokier wrote:
> >Ihar 'Philips' Filipau wrote:
> >
> >> It will depend on arch CPU only in case if you have unlimited i$ size.
> >> Servers with 8MB of cache - yes it is faster.
> >> Celeron with 128k of cache - +4bytes == higher probability of i$ miss
> >>== lower performance.
> >
> >Higher probability != optimal performance.
> >
> >It depends on your execution context. If it's part of a tight loop
> >which is executed often, then saving a cycle in the loop gains more
> >performance than saving icache, even on a 128k Celeron.
> >
>
> You think as system-programmer.
> Every bit of i$ waste - hit user space applications too.
> 128k of $ - is for every app.
>
> If you gained one cycle by polluting one more cache line - do not
> forget that this cache line probably contained some info, which was able
> to avoid cache miss for another application. So you gained cycle here -
> and lost it immediately in another app. Not good.

Usually the whole L1 cache is flushed between appliation context
switches anyway, so the cost of a miss is borne by the application
which causes it.

And that _still_ doesn't change the truth of my statement. One cycle
saved in a loop which executes 1000 times is worth more than an L1
i-cache miss, always.

> If you can improve performance by NOT polluting cache - it would be
> another story :-)))

Yes, of course that is better when it is possible.

Modern OOO CPUs are subtle beasts. Like I said, I added a single
"nop" (one-byte instruction) to a tight graphics loop once, and the
loop went significantly faster. I could not explain it, except that I
know the Pentium Pro instruction decode stage has many quirks.

> But still I never saw this kind of thing being used in kernel.
> Instead of writing normal asm we have something like i386/mmx.c. And
> i386/checksum.S not the best sample of asm in kernel too. Sad.

Those were written before GAS had a macro facility. I agree with you,
it should be used more in the kernel.

The two examples you gave have been carefully tuned on particular
CPUs, by trial and error. Changing the instruction order makes a big
difference to their performance.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/