Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re:[RFC] weird crap with vdso on uml/i386)

From: Linus Torvalds
Date: Tue Aug 23 2011 - 13:34:07 EST


On Tue, Aug 23, 2011 at 9:48 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> Um...  How would it know which syscall variant had that been, to start
> with?

Just read the instruction, for chissake.

UML *already* does that, to see if it's "int80" or "sysenter" ('is_syscall()').

Now, I do agree that if we had designed the ptrace interface with
these kinds of issues in mind, then we would have added a "state"
field to the thing that could have this kind of information as part of
the GETREGS interface. There is no question that that would have been
a good idea - but we have what we have.

I mean, technically, we could also have always just given "raw user
space register state" to ptrace, and then just said that "anybody who
traces system calls needs to know the exact calling conventions for
*that* kind of system call". But instead of that, we give the "cooked"
pt_regs values on read-out, to make it simpler for strace and friends.

And it's actualyl simpler for UML too. If we *didn't* give that cooked
register set information, then UML would *still* have to look at the
actual instruction in order to emulate the system call correctly
("it's sysenter, so now I need to take some of the system call
arguments from the stack"). So the fact that we do that register state
swizzling actually helps not just strace, but UML too.

It would be *nice* if we did the swizzling automatically at setregs()
time too, but we simply don't have enough information in the kernel to
do that. Again, exactly because pt_regs doesn't have a "state"
variable, when user-space does the SETREGS call, we simply don't know
whether we are in "normal" code or in some system call entry or exit
state. So the kernel does the swizzling at GETREGS time (by virtue of
always having the registers in a "canonical" state for system call
entry), but we fundamentally *cannot* to do the unswizzle, because we
don't know what the SETREGS caller actually did.

So I think the current state is actually the best we could possibly
do, with the caveat that *if* we had known about the "different system
calls have different register layouts" originally and had thought of
it, we could have added a 'state' word that the kernel could set at
GETREGS time, and use at SETREGS time to decide whether swizzling is
needed or not.

But not only would that have required time travel (ptrace existed
before the multiple system calls did), even then it's not 100% clear
that the current simpler model (with the admittedly subtle case of
implicit state and its effect on register state) isn't actually the
better solution. *Somebody* has to do the register swizzling, and the
current "kernel canonicalizes registers at read time, you need to
swizzle them if you change state" may simply be the RightThing(tm).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/