Re: [PATCH 6/7] x86/vdso: Add vDSO functions for user wait instructions

From: Andy Lutomirski
Date: Mon Jul 23 2018 - 22:11:54 EST


On 07/23/2018 05:55 AM, Fenghua Yu wrote:
User wants to query if user wait instructions (umonitor, umwait, and
tpause) are supported and use the instructions. The vDSO functions
provides fast interface for user to check the support and use the
instructions.

waitpkg_supported and its alias __vdso_waitpkg_supported check if
user wait instructions (a.k.a. wait package feature) are supported

umonitor and its alias __vdso_umonitor provide user APIs for calling
umonitor instruction.

umwait and its alias __vdso_umwait provide user APIs for calling
umwait instruction.

tpause and its alias __vdso_tpause provide user APIs for calling
tpause instruction.

nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds
to TSC counter if TSC frequency is known. It will fail if TSC frequency
is unknown.

The instructions can be implemented in intrinsic functions in future
GCC. But the vDSO interfaces are available to user without the
intrinsic functions support in GCC and the API waitpkg_supported and
nsec_to_tsc cannot be implemented as GCC functions.

Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx>
---
arch/x86/entry/vdso/Makefile | 2 +-
arch/x86/entry/vdso/vdso.lds.S | 10 ++
arch/x86/entry/vdso/vma.c | 9 ++
arch/x86/entry/vdso/vuserwait.c | 233 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/vdso_funcs_data.h | 3 +
5 files changed, 256 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/vdso/vuserwait.c

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index af4fcae5de83..fb0062b09b3c 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32) := y
VDSO32-$(CONFIG_IA32_EMULATION) := y
# files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o vuserwait.o
# files to link into kernel
obj-y += vma.o
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index 097cdcda43a5..0942710608bf 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -35,6 +35,16 @@ VERSION {
__vdso_movdir64b_supported;
movdir64b;
__vdso_movdir64b;
+ waitpkg_supported;
+ __vdso_waitpkg_supported;
+ umonitor;
+ __vdso_umonitor;
+ umwait;
+ __vdso_umwait;
+ tpause;
+ __vdso_tpause;
+ nsec_to_tsc;
+ __vdso_nsec_to_tsc;
local: *;
};
}
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index edbe5e63e5c2..006dfb5e5003 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu)
static void __init init_vdso_funcs_data(void)
{
+ struct system_counterval_t sys_counterval;
+
if (static_cpu_has(X86_FEATURE_MOVDIRI))
vdso_funcs_data.movdiri_supported = true;
if (static_cpu_has(X86_FEATURE_MOVDIR64B))
vdso_funcs_data.movdir64b_supported = true;
+ if (static_cpu_has(X86_FEATURE_WAITPKG))
+ vdso_funcs_data.waitpkg_supported = true;
+ if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+ vdso_funcs_data.tsc_known_freq = true;
+ sys_counterval = convert_art_ns_to_tsc(1);
+ vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles;
+ }

You're losing a ton of precision here. You might even be losing *all* of the precision and malfunctioning rather badly.

The correct way to do this is:

tsc_counts = ns * mul >> shift;

and the vclock code illustrates it. convert_art_ns_to_tsc() is a bad example because it uses an expensive division operation for no good reason except that no one bothered to optimize it.

+notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc)
+{
+ if (!_vdso_funcs_data->tsc_known_freq)
+ return -ENODEV;
+
+ *tsc = _vdso_funcs_data->tsc_per_nsec * nsec;
+
+ return 0;
+}

Please don't expose this one at all. It would be nice for programs that use waitpkg to be migratable using CRIU-like tools, and this export actively harms any such effort. If you omit this function, then the kernel could learn to abort an in-progress __vdso_umwait if preempted (rseq-style) and CRIU would just work. It would be a bit of a hack, but it solves a real problem.

+notrace int __vdso_umwait(int state, unsigned long nsec)

__vdso_umwait_relative(), please. Because some day (possibly soon) someone will want __vdso_umwait_absolute() and its friend __vdso_read_art_ns() so they can do:

u64 start = __vdso_read_art_ns();
__vdso_umonitor(...);
... do something potentially slow or that might fault ...
__vdso_umwait_absolute(start + timeout);

Also, this patch appears to have a subtle but show-stopping race. Consider:

1. Task A does UMONITOR on CPU 1
2. Task A is preempted.
3. Task B does UMONITOR on CPU 1 at a different address
4. Task A resumes
5. Task A does UMWAIT

Now task A hangs, at least until the next external wakeup happens.

It's not entirely clear to me how you're supposed to fix this without some abomination that's so bad that it torpedoes the entire feature. Except that there is no chicken bit to turn this thing off. Sigh.