Re: SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weirdcrap with vdso on uml/i386)

From: Andrew Lutomirski
Date: Sun Aug 21 2011 - 22:02:54 EST


On Sun, Aug 21, 2011 at 9:48 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> On 08/21/2011 06:41 PM, Linus Torvalds wrote:
>> If people are using syscall directly, we're pretty much stuck. No
>> amount of "that's hopelessly wrong" will ever matter. We don't break
>> existing binaries.
>>
>> That said, I'd *hope* that everybody uses the vdso32, simply because
>> user programs are not supposed to know which CPU they are running on
>> and if that CPU even *supports* the syscall instruction. In which case
>> it may be possible that we can play games with the vdso thing. But
>> that really would be conditional on "nobody ever reports a failure".
>
> I think we found that out with the vsyscall emulation issue last cycle.
>  It works, so it will have been used, somewhere...
>
>> But if that's possible, maybe we can increment the RIP by 2 for
>> 'syscall', and slip an "'int 0x80" after the syscall instruction in
>> the vdso there? Resulting in the same pseudo-solution I suggested for
>> sysenter...
>
> I think we have the above problem.
>
> The problem here is that the syscall state is actually more complex than
> we retain: the entire state is given by (entry point, register state);
> with that amount of state we have all the information needed to *either*
> extract the syscall arguments *or* the register contents.  Without
> those, we can only represent one of the two possible metalevels (right
> now we represent the higher-level metalevel, the argument vector), but
> we need both for different usages.

My understanding of the problem is the following:

1. The SYSCALL 32-bit calling convention puts arg2 in ebp and arg6 on
the stack.

2. The int 0x80 convention is different: arg2 is in ecx.

3. We're worried that pt_regs-using compat syscalls might want the
regs to appear to match the actual arguments (why?)

4. ptrace expects the "registers" when SYSCALL happens to match the
int 0x80 convention. (This is, IMO, sick.)

5. Syscall restart with the SYSCALL instruction must switch to
userspace and back to the kernel for reasons I don't understand that
presumably involve signal delivery.

6. Existing ABI requires that the kernel not clobber syscall
arguments (except, of course, when ptrace or syscall restart
explicitly change those arguments).

So we're sort of screwed. arg2 must be in ecx to keep ptrace happy
but SYSCALL clobbers ecx, so arg2 cannot be preserved.

So here are three strawman ideas:

a) Change #4. Maybe it's too late to do this, though.

b) When SYSCALL happens, change RIP to point two bytes past an int
0x80 instruction in the vdso. Make the next instruction there be a
"ret" that returns to the instruction after the original syscall.
Patch the stack in the kernel.

c) Disable syscall restart when SYSCALL happens from somewhere outside the vdso.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/