Re: 2.3.30pre1 syscall w/6 args support?

Artur Skawina (skawina@geocities.com)
Wed, 08 Dec 1999 04:13:54 +0100


Linus Torvalds wrote:
>
> > Is this really absolutely necessary? I have SYSENTER based syscalls
> > mostly working, cutting the cost of a system call in half (currently
> > from 281 cycles for a int80 based getpid to only 137, but the numbers
> > will change a little as i finish it off). Supporting six args is not
> > really an option however (not w/o extra kernel side translations --
> > there are simply no registers left to pass the required info
> > (esp/eip))..
>
> Actually, if you require passing of eip, you're almost certainly doing
> something wrong.

well, while trying to get it to work i've often wondered what the
person who came up with the sysenter/sysexit scheme at intel had been
sniffing... [for people who might not be familiar with sysenter:
passing of eip/esp is necessary because sysenter doesn't save it, this is
true always, even with Linus' scheme; the only choice we have is _how_
to do that, and there's no perfect solution]

> The thing about eip is that you can (and should) be able to run the same
> library and binaries on both a kernel with and without SYSENTER. Without
> having to test for SYSENTER in user mode and do different library things.

the choice becomes: do you want the libc to do this (presumably just one
test on bootup which selects the proper stubs to use) or do you want the
kernel to do very much the same (by providing the stubs (or as you
suggested some kind of intermidiate single C entry point, which isn't a
good solution either).

> And THAT in turn means that to do SYSENTER right, the kernel should map a
> magic page into each process' address space (use the FIXMAP capability -
> we're already using that to map things like the IOAPIC at a fixed
> address), and you do a system call simply by doing a "call" to that magic
> page: the kernel can set up whatever it wants (be it "int 0x80" for
> old-style system calls, SYSENTER for new system calls, or some magic trap
> for Merced - you get the idea) at that address.

I did consider this. The reasons i didn't do it that way were (1) to
make it as nonitrusive as possible (the patch currently is just a few
dozen lines) and (2) policy -- adding hardwired userspace mappings.
(the performance difference would be ~ a few cycles, so that's not a
big issue). Now, if (1) isn't a problem i may consider that scheme again.
Assuming nobody sees any problems w/ (2) above?

> And that in turn means that you don't need to save the return EIP for
> SYSENTER, because the only way to get the EIP anyway is just from the
> stack that was required for the call to the magic address - the actual
> address of the "sysenter" instruction is not even interesting.

Until now i've actually ignored all 5 arg syscalls, as the int80
entry point is always there and the gain is smaller the more the
complex the syscall is; so i didn't have to do even that. But i had
planned on changing that -- this is one reason i said the numbers
given above are going to change a bit.

> In fact, I would argue that the proper way to handle this is:
>
> - no sysenter capability on the CPU: use "int 0x80":
>
> magic_address:
> movl 4(%esp),%ebx
> movl 8(%esp),%ecx
> movl 12(%esp),%edx
> movl 16(%esp),%esi
> movl 20(%esp),%edi
> movl 24(%esp),%ebp
> int $0x80
> ret
>
> - sysenter:
>
> magic_address:
> movl %esp,%ebx
> sysenter
>
> and in both cases the magic address is just called with:
>
> - arguments on the stack in user mode
> - %eax contains the system call number
>
> and in the sysenter case you'll just end up having to fetch the arguments
> from the stack (you need to touch the stack anyway in order to get the
> return address).
>
> At least with the above, we can handle _any_ future calling convention.

Yep. In theory this seems perfect. Other people have already pointed out some
obvious issues, there are probably more; most are likely solvable, but could
add significantly to the complexity. [there are some less obvious issues like
you can not, for various reasons, blindly use the sysenter/sysexit pair
(eg you can not return with sysexit from execve) and in some cases i found
int80 to be faster (eg signal handling, when restarting syscalls)]

> Anybody who thinks that having two different libc's on different machines
> is acceptable is so out of touch with reality that it is scary.

This has been true for quite a while, having an optimized libc for
your architecture makes sense. If you need to share libraries you
just pick the least common demoninator (and pay the price in performance,
but in reality that's usually very low).

Not that i think that this is relevant in this case -- the question is:
is there anything that the kernel can do that could not be done equally
well in libc? Why shouldn't libc be setting up that special magic entry
point all by itself?

[i'm obviously a bit biased since i already have sysenter working w/o
the kernel mapped stubs, but wouldn't have a problem switching to
another solution, if i'd be convinced it would be beneficial.
SYSENTER/EXIT has it's problems, but the numbers speak for themselfes.
A few examples:
INT80 SYSENTER
lat_syscall null 0.6814 0.3394
lat_syscall read 1.0477 0.6747
lat_syscall write 0.8422 0.4937
lat_fifo 7.4193 6.3050
lat_pipe 7.4369 6.3880
lat_unix 14.9843 14.1488

time dd if=/dev/zero of=/dev/null bs=1 count=10000000
0:20.21 0:13.00
]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/