[PATCH 3/3] x86, pkeys: allow instruction fetches in presence of pkeys

From: Dave Hansen
Date: Fri Jul 22 2016 - 14:03:44 EST



From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

Thanks to Andy Lutomirski for pointing out the potential issue
here.

Memory protection keys only affect data access. They do not
affect instruction fetches. So, an instruction may not be
readable, while it *is* executable.

The fault prefetch checking code directly reads instructions and
might trip over an instruction made unreadable by pkeys. Turn
off pkeys temporarily so we can fetch the instruction.

===== Long explanation: =====

We have a long history of bugs with prefetch instructions faulting
when they should not. So, in the slow paths of our page fault
code, we go peeking at the instructions being run when the fault
occurred to see if the fault could have been caused by a prefetch
instruction.

To do the peeking, the kernel looks at the instruction pointer
and fetches the instruction, sometimes right out of userspace.
But, protection keys may get in the way and will make the
kernel's data access fault. The kernel will assume that the
instruction pointer was bogus and go angrily along on the way to
killing the app instead of _actually_ fetching a prefix
instruction.

Does this matter? Probably not. The hardware implementing
protection keys is not even publicly available yet. But, if it
or some future processor has a prefetch bug, this will help us
properly work around it.

This makes the instruction temporarily readable so that we can do
a data access and peek at its opcode. This operation is only
visible to the thread doing the fault.

I thought this might be painful to solve, requiring something
akin to a kernel_fpu_begin/end() pair. After thinking about it,
I managed to convince myself that we do not need that harsh of
a solution and this simple one is OK, mostly because we never
do lazy save/restore of PKRU.

This is slow-ish. The RDPKRU/WRPKRU/WRPKRU sequence probably
costs us dozens of cycles, plus the extra branches from the
if(OSPKE) checks. It also does that for each byte of the
instructions that it fetches, which is a bit naughty. But, this
is a slow path already, so I haven't tried to optimize it at all.

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
---

b/arch/x86/mm/fault.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-900-fix-prefetch arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-900-fix-prefetch 2016-07-22 10:52:33.274364481 -0700
+++ b/arch/x86/mm/fault.c 2016-07-22 10:52:33.277364616 -0700
@@ -20,6 +20,7 @@
#include <asm/pgalloc.h> /* pgd_*(), ... */
#include <asm/kmemcheck.h> /* kmemcheck_*(), ... */
#include <asm/fixmap.h> /* VSYSCALL_ADDR */
+#include <asm/fpu/internal.h> /* for use_eager_fpu() checks */
#include <asm/vsyscall.h> /* emulate_vsyscall */
#include <asm/vm86.h> /* struct vm86 */
#include <asm/mmu_context.h> /* vma_pkey() */
@@ -76,6 +77,52 @@ static nokprobe_inline int kprobes_fault
}

/*
+ * Memory protection keys only affect data access. They do not
+ * affect instruction fetches. So, an instruction may not be
+ * readable, while it *is* executable. This makes the
+ * instruction temporarily readable so that we can do a data
+ * access and peek at its opcode.
+ */
+static
+int probe_insn_opcode(void *insn_address, unsigned char *ret_opcode)
+{
+ int ret;
+ u32 saved_pkru = read_pkru();
+
+ /*
+ * Clear out all of the access/write-disable bits in
+ * PKRU. This ensures that pkeys will not block access
+ * to @insn_address. If no keys are access-disabled
+ * (saved_pkru==0) avoid the cost of the PKRU writes
+ * and the continued cost of having taken it out of its
+ * (XSAVE) init state.
+ *
+ * Note also that this only affect access to user
+ * addresses. Kernel (supervisor) mappings are not
+ * affected by this register.
+ */
+ if (saved_pkru)
+ write_pkru(0);
+ /*
+ * We normally have to be very careful with FPU registers
+ * and preempt. But, PKRU is different. It is never
+ * lazily saved/restored, so we don't have to be as
+ * careful with this as normal FPU state. Enforce this
+ * assumption with the WARN_ON().
+ */
+ if (cpu_feature_enabled(X86_FEATURE_OSPKE))
+ WARN_ON_ONCE(!use_eager_fpu());
+ ret = probe_kernel_address(insn_address, *ret_opcode);
+ /*
+ * Restore PKRU to what it was. We a
+ */
+ if (saved_pkru)
+ write_pkru(saved_pkru);
+
+ return ret;
+}
+
+/*
* Prefetch quirks:
*
* 32-bit mode:
@@ -126,7 +173,7 @@ check_prefetch_opcode(struct pt_regs *re
return !instr_lo || (instr_lo>>1) == 1;
case 0x00:
/* Prefetch instruction is 0x0F0D or 0x0F18 */
- if (probe_kernel_address(instr, opcode))
+ if (probe_insn_opcode(instr, &opcode))
return 0;

*prefetch = (instr_lo == 0xF) &&
@@ -160,7 +207,7 @@ is_prefetch(struct pt_regs *regs, unsign
while (instr < max_instr) {
unsigned char opcode;

- if (probe_kernel_address(instr, opcode))
+ if (probe_insn_opcode(instr, &opcode))
break;

instr++;
_