[PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline"

From: Andy Lutomirski
Date: Fri Dec 01 2017 - 01:30:27 EST


This fixes a huge performance regression.

Please add to the changelog:

This patch actually seems to be a small speedup. With this patch,
SYSCALL touches an extra cache line and an extra virtual page, but
the pipeline no longer stalls waiting for SWAPGS. It seems that, at
least in a tight loop, the latter outweights the former.

Thanks to David Laight for an optimization tip.

[end addition to changelog]

Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
---
arch/x86/entry/entry_64.S | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index caf74a1bb3de..28f4e7553c26 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -180,14 +180,24 @@ ENTRY(entry_SYSCALL_64_trampoline)

/*
* x86 lacks a near absolute jump, and we can't jump to the real
- * entry text with a relative jump, so we fake it using retq.
+ * entry text with a relative jump. We could push the target
+ * address and then use retq, but this destroys the pipeline on
+ * many CPUs (wasting over 20 cycles on Sandy Bridge). Instead,
+ * spill RDI and restore it in a second-stage trampoline.
*/
- pushq $entry_SYSCALL_64_after_hwframe
- retq
+ pushq %rdi
+ movq $entry_SYSCALL_64_stage2, %rdi
+ jmp *%rdi
END(entry_SYSCALL_64_trampoline)

.popsection

+ENTRY(entry_SYSCALL_64_stage2)
+ UNWIND_HINT_EMPTY
+ popq %rdi
+ jmp entry_SYSCALL_64_after_hwframe
+END(entry_SYSCALL_64_stage2)
+
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
--
2.13.6