Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

From: Ingo Molnar
Date: Thu Sep 16 2004 - 01:28:44 EST



* Andi Kleen <ak@xxxxxxx> wrote:

> Known problem. Interrupts don't save regs->rbp, but the new profile_pc
> that was introduced recently uses it.
>
> One quick fix is to just use SAVE_ALL in the interrupt entry code, but
> I don't like this because it will affect interrupt latency.
>
> The real fix is to fix profile_pc to not reference it.

it would be nice if we could profile the callers of the spinlock
functions instead of the opaque spinlock functions.

the ebp trick is nice, but forcing a formal stack frame for every
function has global performance implications. Couldnt we define some
sort of current-> field [or current_thread_info() field] that the
spinlock code could set and clear, which field would be listened to by
profile_pc(), so that the time spent spinning would be attributed to the
callee? Something like:

spin_lock()
current->real_pc = __builtin_return_address(0);
...
current->real_pc = 0;

profile_pc():
...
if (current->real_pc)
pc = current->real_pc;
else
pc = regs->rip;

this basically means that we set up a 'conditional frame pointer' in the
spinlock functions - but only in the spinlock functions! It is true that
this would be 1-2 cycles more expensive than using the frame pointer
register but considering the totality of performance i believe this is
wastly superior.

AFAIR __builtin_return_address(0) is nice and sweet on all platforms,
and it works even with -fomit-frame-pointers. (levels 1 and more are not
guaranteed to work, but level 0 always works.) On x86/x64 gcc derives it
from %esp so the overhead of setting up current->real_pc. On platforms
that have 'current' (or current_thread_info()) in a register, saving and
clearing current->real_pc ought to be a matter of 2-3 instructions.
(could be 2 instructions total on x64 - the same cost as
-fno-omit-frame-pointer ebp saving would have!)

->real_pc would have a small race window from the point it's cleared up
until it returns, but it's not a big issue because 99% of the spinlock
related profile overhead is either in the spinning section or the first
access to the cacheline. If there is some small overhead measured
spin_lock() it's not a big issue, as long as the brunt of the overhead
is attributed to the ->real_pc callee.

spin_unlock() doesnt even need to set up ->real_pc - making this
solution even cheaper.

hm?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/