[PATCH v14 0/9] x86/arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

From: Kyle Huey
Date: Wed Feb 08 2017 - 03:09:51 EST


rr (http://rr-project.org/), a userspace record-and-replay reverse-
execution debugger, would like to trap and emulate the CPUID instruction.
This would allow us to a) mask away certain hardware features that rr does
not support (e.g. RDRAND) and b) enable trace portability across machines
by providing constant results.

Newer Intel CPUs (Ivy Bridge and later) can fault when CPUID is executed at
CPL > 0. Expose this capability to userspace as a new pair of arch_prctls,
ARCH_GET_CPUID and ARCH_SET_CPUID.

Since v13:
All: rebased on top of tglx's __switch_to_xtra patches
(https://lkml.org/lkml/2016/12/15/432)

Patch 6: x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
- Removed bogus assertion about interrupts

Patch 9: x86/arch_prctl: Rename 'code' argument to 'option'
- New

Three issues were raised last year on the v13 patches. tglx pointed out
that the lock checking assertion in patch 6 was bogus, and it has been
removed. ingo asked that we rename the second argument of arch_prctl to
option, which patch 9 was added to do.

The third issue raised was about performance and code generation. With the
__switch_to_xtra optimizations I suggested in my response that are
implemented in tglx's patches, the extra burden this feature imposes on
context switches that fall into __switch_to_xtra but do not use CPUID faulting
is a single AND and branch. Compare,

Before:
276: 49 31 dc xor %rbx,%r12
279: 41 f7 c4 00 00 00 02 test $0x2000000,%r12d
280: 75 43 jne 2c5 <__switch_to_xtra+0x95>
282: 41 f7 c4 00 00 01 00 test $0x10000,%r12d
289: 74 17 je 2a2 <__switch_to_xtra+0x72>
28b: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax # 293 <__switch_to_xtra+0x63>
292: 00
293: 48 83 f0 04 xor $0x4,%rax
297: 65 48 89 05 00 00 00 mov %rax,%gs:0x0(%rip) # 29f <__switch_to_xtra+0x6f>
29e: 00
29f: 0f 22 e0 mov %rax,%cr4

After:
306: 4c 31 e3 xor %r12,%rbx
309: f7 c3 00 00 00 02 test $0x2000000,%ebx
30f: 0f 85 87 00 00 00 jne 39c <__switch_to_xtra+0xdc>
315: f7 c3 00 00 01 00 test $0x10000,%ebx
31b: 74 17 je 334 <__switch_to_xtra+0x74>
31d: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax # 325 <__switch_to_xtra+0x65>
324: 00
325: 48 83 f0 04 xor $0x4,%rax
329: 65 48 89 05 00 00 00 mov %rax,%gs:0x0(%rip) # 331 <__switch_to_xtra+0x71>
330: 00
331: 0f 22 e0 mov %rax,%cr4
334: 80 e7 80 and $0x80,%bh
337: 75 23 jne 35c <__switch_to_xtra+0x9c>

And this is after the optimizations removed 3 conditional branches from
__switch_to_xtra, so we're still ahead a net 2 branches.

The generated code for set_cpuid_faulting is,

With CONFIG_PARAVIRT=n, inlined into __switch_to_xtra:
35c: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax # 364 <__switch_to_xtra+0xa4>
363: 00
364: 48 83 e0 fe and $0xfffffffffffffffe,%rax
368: b9 40 01 00 00 mov $0x140,%ecx
36d: 48 89 c2 mov %rax,%rdx
370: 4c 89 e0 mov %r12,%rax
373: 48 c1 e8 0f shr $0xf,%rax
377: 83 e0 01 and $0x1,%eax
37a: 48 09 d0 or %rdx,%rax
37d: 48 89 c2 mov %rax,%rdx
380: 65 48 89 05 00 00 00 mov %rax,%gs:0x0(%rip) # 388 <__switch_to_xtra+0xc8>
387: 00
388: 48 c1 ea 20 shr $0x20,%rdx
38c: 0f 30 wrmsr

With CONFIG_PARAVIRT=y:

in __switch_to_xtra:
354: 80 e7 80 and $0x80,%bh
357: 74 0f je 368 <__switch_to_xtra+0x88>
359: 4c 89 e7 mov %r12,%rdi
35c: 48 c1 ef 0f shr $0xf,%rdi
360: 83 e7 01 and $0x1,%edi
363: e8 98 fc ff ff callq 0 <set_cpuid_faulting>

0000000000000000 <set_cpuid_faulting>:
0: e8 00 00 00 00 callq 5 <set_cpuid_faulting+0x5>
5: 55 push %rbp
6: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # e <set_cpuid_faulting+0xe>
d: 00
e: 48 89 d0 mov %rdx,%rax
11: 40 0f b6 d7 movzbl %dil,%edx
15: 48 89 e5 mov %rsp,%rbp
18: 48 83 e0 fe and $0xfffffffffffffffe,%rax
1c: bf 40 01 00 00 mov $0x140,%edi
21: 48 09 c2 or %rax,%rdx
24: 89 d6 mov %edx,%esi
26: 65 48 89 15 00 00 00 mov %rdx,%gs:0x0(%rip) # 2e <set_cpuid_faulting+0x2e>
2d: 00
2e: 48 c1 ea 20 shr $0x20,%rdx
32: ff 14 25 00 00 00 00 callq *0x0
39: 5d pop %rbp
3a: c3 retq

While these sequences are less efficient than the "load msr from the task struct and wrmsr"
sequence that Ingo's suggestion would produce, this is entirely reasonable code and avoids
taking up 8 bytes in the task_struct that will almost never be used.

As I said last year, obviously I want to get this into the kernel or I wouldn't be here.
So if Ingo or others insist on caching the MSR in the task_struct I'll do it, but I still
think this is a good approach.

- Kyle