[tip: x86/entry] x86/entry/64: Sign-extend system calls on entry to int

From: tip-bot2 for H. Peter Anvin (Intel)
Date: Thu May 20 2021 - 09:23:46 EST


The following commit has been merged into the x86/entry branch of tip:

Commit-ID: 0595494891723a1dcca5eaa8eeca8ab54ad953b9
Gitweb: https://git.kernel.org/tip/0595494891723a1dcca5eaa8eeca8ab54ad953b9
Author: H. Peter Anvin (Intel) <hpa@xxxxxxxxx>
AuthorDate: Tue, 18 May 2021 12:13:01 -07:00
Committer: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
CommitterDate: Thu, 20 May 2021 15:19:49 +02:00

x86/entry/64: Sign-extend system calls on entry to int

Right now, *some* code will treat e.g. 0x0000000100000001 as a system
call and some will not. Some of the code, notably in ptrace, will
treat 0x000000018000000 as a system call and some will not. Finally,
right now, e.g. 335 for x86-64 will force the exit code to be set to
-ENOSYS even if poked by ptrace, but 548 will not, because there is an
observable difference between an out of range system call and a system
call number that falls outside the range of the table.

This is visible to the user: for example, the syscall_numbering_64
test fails if run under strace, because as strace uses ptrace, it ends
up clobbering the upper half of the 64-bit system call number.

The architecture independent code all assumes that a system call is "int"
that the value -1 specifically and not just any negative value is used for
a non-system call. This is the case on x86 as well when arch-independent
code is involved. The arch-independent API is defined/documented (but not
*implemented*!) in <asm-generic/syscall.h>.

This is an ABI change, but is in fact a revert to the original x86-64
ABI. The original assembly entry code would zero-extend the system call
number;

Use sign extend to be explicit that this is treated as a signed number
(although in practice it makes no difference, of course) and to avoid
people getting the idea of "optimizing" it, as has happened on at least
two(!) separate occasions.

Do not store the extended value into regs->orig_ax, however: on x86-64, the
ABI is that the callee is responsible for extending parameters, so only
examining the lower 32 bits is fully consistent with any "int" argument to
any system call, e.g. regs->di for write(2). The full value of %rax on
entry to the kernel is thus still available.

[ tglx: Add a comment to the ASM code ]

Signed-off-by: H. Peter Anvin (Intel) <hpa@xxxxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/20210518191303.4135296-5-hpa@xxxxxxxxx

---
arch/x86/entry/entry_64.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1d9db15..a5f02d0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -108,7 +108,8 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)

/* IRQs are off. */
movq %rsp, %rdi
- movq %rax, %rsi
+ /* Sign extend the lower 32bit as syscall numbers are treated as int */
+ movslq %eax, %rsi
call do_syscall_64 /* returns with IRQs disabled */

/*