Re: [patch 2/3] mutex subsystem: fastpath inlining

From: Nicolas Pitre
Date: Tue Dec 27 2005 - 16:58:11 EST


On Tue, 27 Dec 2005, Ingo Molnar wrote:

>
> * Nicolas Pitre <nico@xxxxxxx> wrote:
>
> > Some architectures, notably ARM for instance, might benefit from
> > inlining the mutex fast paths. [...]
>
> what is the effect on text size? Could you post the before- and
> after-patch vmlinux 'size kernel/test.o' output in the nondebug case,
> with Arjan's latest 'convert a couple of semaphore users to mutexes'
> patch applied? [make sure you've got enough of those users compiled in,
> so that the inlining cost is truly measured. Perhaps also do
> before/after 'size' output of a few affected .o files, without mixing
> kernel/mutex.o into it, like vmlinux does.]

Theory should be convincing enough. First of all, all the semaphore
fast paths are always inlined currently, on all architectures I've
looked at. A down() fast path is always looking like this:

mrs ip, cpsr
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
subs lr, lr, #1
str lr, [r0]
msr cpsr_c, ip
movmi ip, r0
blmi __down_failed

So our starting point for comparison is 9 instructions for every down()
occurence in the kernel. Same thing for up(). Every instruction is
invariably 4 bytes.

Now let's look at the typical mutex_lock():

mov r4, #0
swp r3, r4, [r0]
cmp r3, #1
blne __mutex_lock_noinline

This is 4 instructions. Further more, the first "mov r4, #0" can often
be eliminated when gcc can cse the constant 0 from another
register. We're talking about 3 instructions then, down from 9 !

We therefore saves between 20 and 24 bytes of kernel .text for every
down() and every up() simply going with mutexes.

Now if the mutex_lock and mutex_unlock were not inlined, the above 3 or
4 instructions would become one or two per call site, which is still a
gain in space, however not as important as the one provided by the move
from semaphores to mutexes. It however would be more costly in terms of
cycles since a function prologue and epilogue is somewhat costly on ARM,
especially with frame pointer enabled (I'll let RMK elaborate on his
reasons for not disabling them).

And for mutex_lock_interruptible(), the inlined fastpath is not bigger
than the non-inlined one, considering that the return value has to be
tested (the test is done twice in the non-inlined case: once inside the
function, and once outside of it) while the inlined version needs only
one test. They are therefore equivalent in terms of space.


Nicolas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/