Re: Proposal for finishing the 64-bit x86 syscall cleanup

From: Andy Lutomirski
Date: Tue Aug 25 2015 - 12:29:24 EST


On Tue, Aug 25, 2015 at 3:59 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Mon, Aug 24, 2015 at 5:13 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>> We could also annotate with syscalls need full regs and jump to the
>> slow path for them. This would leave the fast path unchanged (we
>> could duplicate the sys call table so that regs-requiring syscalls
>> would turn into some asm that switches to the slow path). We'd make
>> the syscall table say something like:
>>
>> 59 64 execve sys_execve:regs
>>
>> The fast path would have exactly identical performance and the slow
>> path would presumably speed up. The down side would be additional
>> complexity.
>
> I don't think it is worth it to optimize the syscalls that need full
> pt_regs (which are generally quite expensive and less frequently used)
> at the expense of every other syscall.
>
> What kind of cleanups, other than just removing the stubs, would this
> allow? Is there more code you plan to move to C?

This isn't about optimizing the regs-using syscalls at all -- it's
about simplifying all the other ones and optimizing the slow path.

The way that the regs-using syscalls currently work is that the entry
in the syscall table expects to see rbx, rbp, and r12-r15 in
*registers* and it shoves them into pt_regs and pulls them back out.
This means that we pretty much have to call syscalls from asm, which
precludes the straightforward re-implementation of the whole slow path
as:

void do_slow_syscall(...) {
enter_from_user_mode();
fixup_arg5 [if compat fast syscall];
seccomp, etc;
if (nr < max)
call the syscall;
exit tracing;
prepare_return_to_usermode();
}

I bet that, with a bit of tweaking, that would actually end up faster
than what we do right now for everything except fully fast-path
syscalls. This would also be a *huge* sanity improvement for the
compat case in which the args are currently jumbled in asm. It would
become:

if (nr < max)
call the syscall(regs->bx, regs->cx, regs->dx, ...);

which completely avoids the unreadable and probably buggy mess we have now.

We could just get rid of the compat fast path entirely -- I would be a
bit surprised if anyone cared about a couple cycles for compat, but I
don't think it's a great idea long-term to have the compat path fully
written in C but the native 64-bit path partially in asm.

My concrete idea here is to have two 64-bit syscall tables: fast and
slow. The slow table would point to the real C functions for all
syscalls. The fast table would be the same except for the syscalls
that use regs; for those syscalls it would point to:

GLOBAL(stub_switch_to_slow_path_64)
popq %r11 /* discard return address */
movq %rbp, RBP(%rsp), etc;
jmp entry_SYSCALL_64_slow_path
END(stub_switch_to_slow_path_64)

so that the regs-using syscalls take the slow path no matter what.
This doesn't even require autogenerated stubs, since they can all
share the same stub.

Now the 64-bit fast path can stay more or less the same (we'd reorder
the first flags test and the subq $(6*8), %rsp), and the slow path can
be almost all in C.

Then I can back out the two-phase entry tracing thing, and after
*that*, muahaha, I can dust off some languishing seccomp improvements
I have that are incompatible with two-phase entry tracing.

(I have a half-written test case to exercise the dark corners of
syscall args and tracing. So far it catches a bug in SYSCALL32 that
was apparently never fixed (which makes me wonder why signal-heavy
workloads work on AMD systems in compat mode), but I haven't extended
it enough to catch the R9 thing.)

>
>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>
>> Userspace debuggers really like having the vdso properly
>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>> understands it.
>>
>> I want to be able to change the entries a little bit to clean them up
>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>> currently suck), but it's really, really messy right now because of
>> the hex CFI stuff. Could we just drop the CFI annotations if the
>> binutils version is too old or even just require new enough binutils
>> to build 32-bit and compat kernels?
>
> One thing I want to do is rework the 32-bit VDSO into a single image,
> using alternatives to handle the selection of entry method. The
> open-coded CFI crap has made that near impossible to do.
>

Yes please!

But please don't change the actual instruction ordering at all yet,
since the SYSCALL case seems to be buggy right now.

(If you want to be really fancy, don't use alternatives. Instead
teach vdso2c to annotate the actual dynamic table function pointers so
we can rewrite the pointers at boot time. That will save a cycle or
two.)

--Andy




--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/