Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads

From: Denys Vlasenko
Date: Thu Jun 18 2015 - 07:00:32 EST


On 06/18/2015 11:31 AM, Ingo Molnar wrote:
>> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX,
>> then SYSRET can't possibly complete sooner than in 20 cycles.
>
> Yeah, that's true, but my point is: SYSRET has to do a lot of other things
> (permission checks, loading the user mode state - most of which are unrelated to
> R11/RCX), which take dozens of cycles,

SYSRET was designed to avoid doing that. It does not check permissions
- it slam-dunks CPL3 and resets CS and SS to preset values.
It does not touch stack register or restores any other GP register.

Having said that, I'd try to get cold hard facts, i.e. experimentally
measure SYSRET latency.


> and which are probably overlapped with any
> cache misses on arguments such as R11/RCX.
>
> It's not impossible that reordering helps, for example if SYSRET has some internal
> dependencies that makes it parallelism worse than ideal - but I'd complicate this
> code only if it gives a measurable improvement for cache-cold syscall performance.

I attempted to test it. With the patch which moves RCX and R11 loads all the way down:

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index f2064bd..0ea09a3 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -139,9 +139,6 @@ sysexit_from_sys_call:
* with 'sysenter' and it uses the SYSENTER calling convention.
*/
andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
- /* Prepare registers for SYSRET insn */
- movl RIP(%rsp), %ecx /* User %eip */
- movl EFLAGS(%rsp), %r11d /* User eflags *
/* Restore registers per SYSEXIT ABI requirements: */
/* arg1 (ebx): preserved by virtue of being a callee-saved register */
/* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore eip) */
@@ -155,6 +152,9 @@ sysexit_from_sys_call:
xorl %r8d, %r8d
xorl %r9d, %r9d
xorl %r10d, %r10d
+ /* Prepare registers for SYSRET insn */
+ movl RIP(%rsp), %ecx /* User %eip */
+ movl EFLAGS(%rsp), %r11d /* User eflags *
TRACE_IRQS_ON

/*
@@ -374,9 +374,6 @@ cstar_dispatch:

sysretl_from_sys_call:
andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
- /* Prepare registers for SYSRET insn */
- movl RIP(%rsp), %ecx /* User %eip */
- movl EFLAGS(%rsp), %r11d /* User eflags */
/* Restore registers per SYSRET ABI requirements: */
/* arg1 (ebx): preserved by virtue of being a callee-saved register */
/* arg2 (ebp): preserved (already restored, see above) */
@@ -388,6 +385,9 @@ sysretl_from_sys_call:
xorl %r8d, %r8d
xorl %r9d, %r9d
xorl %r10d, %r10d
+ /* Prepare registers for SYSRET insn */
+ movl RIP(%rsp), %ecx /* User %eip */
+ movl EFLAGS(%rsp), %r11d /* User eflags */
TRACE_IRQS_ON
movl RSP(%rsp), %esp
/*

This does not change instructions sizes and therefore code
cacheline alignments over entire bzImage.


Testing getpid() in a loop (IOW: cache-hot test) did show that with
this patch it is slower, but by statistically insignificant amount:

before patch, it's 61.92 ns per syscall.
after patch, it's 61.99 ns per syscall.

That's less than one cycle, more like 0.15 cycles.
However, it is reproducible.

I did not figure out how to do a cache-cold test.
Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns
and its variability of +-10 ns drowns out a possible difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/