Re: Finally the 2.2.0 is out of sight :-).

Malcolm Beattie (mbeattie@sable.ox.ac.uk)
Wed, 27 Jan 1999 11:38:52 +0000 (GMT)


Marcin Dalecki writes:
> Oliver Xymoron wrote:
> >
> > On Wed, 27 Jan 1999, Marcin Dalecki wrote:
> >
> > > What I'm thinking about is porting the kernel to run
> > > as a *compleatly* *normal* user level process. Something along those
> > > lines had been already imeplemnted by prof. Switzer at Univeristy of
> > > Goettingen.
> > > What he had done was in fact a pseudo micro kernel implemented
> > > as a UNIX programm running on a file containing his file system.
> > > It was called TUNIX. (ftp to ftp.gwdg.de will
> > > show where to find it.)
> >
> > I suggested something similar a while back but I realized it had
> > conceptual problems. Presumably your user-mode kernel would want to run
> > binaries for the same arch it was built for, and have a way to reroute
> > syscalls to the pseudo-kernel rather than the real kernel. That in itself
>
> I wouldn't try to reroute the system calls but rather write trivial
> drivers
> using a file for example as the "hardware" for the root partition. This
> wouldn't
> be that ugly.

The timing of this is an interesting coincidence. I actually started
thinking about porting Linux to a "uvmi386 architecture" ("User-mode
Virtual Machine") a couple of weeks ago and started hacking last
weekend. I was going to wait until I had something working at least
minimally before announcing it but since it's being discussed now,
I'll give a brief summary of what I'm doing.

MM is handled by carving out a chunk at the top of user address space
(I was going to use 0xB0000000 until I saw that PAGE_OFFSET was broken
for that so I guess I'll have to use 0x80000000). Then mmap() a file
"physmem" (MAP_SHARED) to represent physical memory there. The kernel
executable itself can go above that. For a given process, use mmap()
to map each page of the physmem area down into the lower part of the
address space. I've checked that this behaves as expected. Change
set_pte to actually mmap the page but we can carry on using the i386
page table layout without actually using the privileged page directory
register. Note that this slow, but I never claimed otherwise. The nice
thing is that the architecture-specific part of the port which is
normally hard-to-understand assembly and low-level hackery for a given
platform is, for uvm, plain libc-style C with mmap(), sigsuspend(),
kill() and so on. That makes it good a teaching tool for porting
Linux. And it doesn't require any kernel changes to the hosting kernel.
Now that only copes with one task address space: I'll come back to
multiple tasks after I deal with syscalls and timers.

For syscalls and timers, there needs to be a small patch to the
hosting kernel. A process can set a flag in its task_struct which
makes any "int $0x80" do something different. A small check in
ENTRY(system_call) in entry.S tests the flag: if it's set then
instead of doing the system call, it generates a signal to the
process, SIGSYS say (which can be an alias for ordinary signal if so
desired), resets the "syscalls allowed" flag in the task_struct and
one other thing I'll mention in a minute. The uvm kernel uses
sigaltstack() (only in kernel 2.2 and not in 2.0 unfortunately) to get
a separate kernel stack. A task running in the uvm kernel in user-mode
runs in the lower part of the address space with each page mmap()d to
part of physmem. When it does a syscall (i.e. ordinary Linux i386
binaries will run fine modulo a lower address space ceiling), it does
an "int $0x80" which, because of the task_struct flag, raises a
signal, restores the "can do syscalls" flag, mprotect()s the address
space from PAGE_OFFSET upwards so it's accessible again and the uvm
kernel gets control as *its* syscall entry in the signal handler.
For ret_from_syscall, it returns from the signal handler. The
sigreturn() then sets the "raise signal on int $0x80" task flag again,
mprotects the PAGE_OFFSET-end address space (so the user-mode task
can't access uvm kernel space) and returns back to the instruction
following the int trap.

The jiffy timer works similarly: it's simply a setitimer().

Back to mm. Multiple tasks need different page directories. The
solution to this that I came up with is a bit ugly (and slow) but
should work. Each task running under the uvm kernel gets a real
task in the hosting kernel. Ie, the uvm kernel implements fork()
with a real fork() just to get a separate address space. The real
processes running vmlinux block in sigsuspend() until woken up.
switch_to() is implemented by looking up the underlying PID of the
task, sending it a signal (SIGUSR1 or similar) and then doing a
sigsuspend() waiting for a similar signal to occur when context
switches back to the original task.

save_flags(flags) is ((flags) = __global_flags),
cli() becomes
__global_flags = 1; sigprocmask(SIG_SETMASK, &cli_set, 0);
and restore_flags(flags) becomes
sigprocmask(SIG_SETMASK, flags ? &cli_set : &sti_set, 0);
__global_flags = flags;

Access to the outside world (the hosting kernel) shouldn't be too
hard. A block driver which uses lseek()/read()/write() on an
underlying real file to get its blocks. A network driver can use
a UNIX domain socket to talk to processes in the hosting
environment. Even without a network driver, a tty-like driver may
be easier and ppp run on both sides would similarly mean that
filestore could be shared between hosting environment and uvm
kernel via NFS.

Current status is that I've copied the i386 includes and arch tree
to uvm versions, removed unneeded stuff from its config file
(no DMA, no PCI, no ISA, ...) and hacked everything around such
that everything builds but with some stubs and missing symbols for
things I haven't done yet. It's not working yet and it'll depend on
how many holes appear in my kernel knowledge as to how long it takes.
I may have missed something really obvious that breaks everything but
I suspect that something like it should work. My intention isn't
really a viable well performing Linux under Linux vm but is mainly
just (1) a cool hack, (2) a good teaching aid, (3) possibly useful
for trying some driver code without crashing a real system, (4)
possibly testing SMP code when you don't have an SMP machine (to
pick up races etc.) and (5) did I mention it's a cool hack?

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/