Re: Inter-Kernel Communications (Multi Kernel Clusters)

Keith Rohrer (
Mon, 24 Feb 1997 04:02:00 -0600 (CST)

> >** network memory space **
> >1. Upon startup of the first kernel on the network it allocates network
> >address space for its physical and swap space, I.e. a machine with 64MB
> >ram and 36MB swap would (by udp broadcast etc.) decide to map it's 100mb
> >of physical (existent) memory space to the first 100MB of virtual
I'd reserve the first chunk for the coordinator, accesses to which would
be intercepted by whatever machine happens to be coordinating at the time.
Beyond that, presumably either there's zero live members of the cluster,
or your "udp broadcast etc." means ask the coordinator for a hunk of space.

> >first effect is we end up with what I will call NETWORK MEMORY SPACE, If
> >there are three machines with 100MB each of memory, then each machine will
> >be able to address 300MB of memory (100MB of it local the rest remote),
I don't think we want to care how big the total NMS is, or whether or not
it's contiguous, since this simplifies nothing and makes fault masking
that much harder. All we want to care about is what addresses have meaning
right now.

> >but this would all operate synonymous to local memory from a user
> >standpoint.
Oh, goodie. Everyone who can forge a request can read and write my
whole core. I hope each local machine retains some access control...

> >This is the first step in mounting kernels together in a
> >CLUSTER. Of course some type of memory locking would be necessary to make
> >this work.
So you're supposing a NOW-based shared memory multiprocessor. Last I
recall the wisdom of the bleeding edge was that shared memory was
"easier" to program with than message-passing, but message passing
was easier and faster to do well.

My intuition is that, if you want a shared memory machine separated
in places by low-speed (i.e. 100 Mbps Ethernet or slower :-)) links,
you'll have to do serious cacheing...resulting in a variety of consistency
protocols and expire-testing frequencies (on a per-page basis, of course,
and someone has to either set the right protocol or write the code to
figure out the correct protocol on the fly...and sometimes the correct
protocol will be "migrate the page", which implies that sharing out
all your actual memory like this would be unwise) that will be more
of a nightmare than this sentence.

Shared memory is a mechanism which can be used to implement stuff; but
I don't think we want to bind ourselves to always using that, or even
implementing it in the first place...we've already got well-tested
message passing setups (UDP primarily; fault masking or server
redundancy where the destination address of a service is not always
one IP number isn't TCP's strong suit) we can build on, and all the
existing UNIX software which reaches in the direction of a distributed
system is message-based.

> The problem with this is that you can have 2 machines start at the same time when the network is down. They both use the same address space and stuff everything up.
This is a well-understood sort of problem; distributed election algorithms
and such are in modern OS textbooks, along with the admission that
partitioning sucks. How to recover from an unplanned partitioning of the
network is a current research problem...

> >2. Each kernel on this three kernel network would deal directly only with
> >its physical memory and devices and processes (which is really all I can
> >this of that the kernel primarily does). So then, very much like the
> >NETWORK MEMORY SPACE we would also need to create a NETWORK DEVICE SPACE,
> >so that every machine on the network could see every other machines
> >devices.
What you really want to export to the whole cluster is the same interface
that the kernel already provides to user-level programs, not raw device
I/O. Further, you want to export more generalized services, including
filesystems; plain devices can be special cases of services. You really
need a layer of sanity checking between the network and your hardware
(or anything else that could make you do things; I don't want someone
forging messages by writing to my ethernet card from a remote machine,
or reading files by reading blocks from my disk and claiming to have
privs). No machine should need low-level drivers for devices not
physically attached; all we'd have to re-implement is one low-level
driver for each kind of service we wanted the kernel to provide (i.e.
we'd have a "neo-NFS" driver and "generic remote" char and maybe block

> > What this really means also is one big network file space. Three
> >machines with 3 drives all on one directory tree --- perhaps even with
> >some kind of mirroring for redundancy. This really however is not the
> >result of some half-ass solution like nfs (I am talking about nfs here not
> >the people who created it, no offense).
> The mirroring is already written. It should be released in a few weeks. I won't mention the name of the person who's about to release it as he's very busy and doesn't need any more Email.
The service-specific code will have to provide a way to keep all providers
of a service in sync (as needed) and resync them (as needed, as best as
possible); the generic service code should provide for requests to auto-
magically go to "any available provider" of the given service (ideally
with load balancing or at least local defaults). A group of servers
presenting that mirrored filesystem to the cluster would register the
service with each of them as alternate providers, and funnel any
requests any of them received to the existing mirroring code (which
in turn would fall straight through to the filesystem code unless a
consistency check failed).

> >** network process space **
> >3. Very much like the above there would also be a NETWORK PROCESS SPACE,
> >where the heavily loaded machines could offload processes onto idle or
> >lightly loaded machines.
The above seems to be intended to make process migration easy; what you
propose here is the actual migration. How to migrate a process will need
writing into the kernel (at the lowest level), but when and why (and to
where) will be user-space (and often user-made) decisions. Currently,
a process has its own set of:

CPU registers, including the PC
migration to a different architecture will fail
virtual memory and memory mapping
migration to a machine without sufficient memory will fail
shared libraries linked in
migration to a machine with different library revs fails
(different revs assumed incompatible)
"open" services and served files
no migration-related problems here
user-level threads are no problem, kernel-level threads
will *usually* want to all migrate together
open files and devices
these must be converted into service-based access (offline
or at migration time)
miscellaneous kernel state, and everything else I've forgotten
this will need to be massaged out of existence

Any machine specific resources (fd's, etc.) must be handled in some way
before migration will work; using a "service" front-end for everything
a machine exports will handle this gracefully, except for truly local,
non-exported resources. In the latter case, the process is either
providing a service related to this machine's hardware, from user space,
or it's stubborn enough that we don't need to make migration work.
The "miscellaneous" entry should be empty, so long as a process itself
never gets suspended when it's actually running in kernel space; I hope
this is already true anyway.

The important supports we need for migration, then, are:
a service mechanism with support for direct access as usual by
local processes
service/local transparency in the kernel or libraries
kernel-provided threads that are recognizably all one process
a kernel mechanism for migration, and (user-space) programs to
do the work
I'm sure I'm missing something here...

> Why not have compute servers, disk servers, etc like Amoeba (it's something to consider at least)?
De facto, sure; certainly you'd want to be able start pov-ray on a compute
server, tar on an appropriate fileserver, but have no danger of doom starting
up elsewhere. But migration should either be done by hand or be very
configurable when done automatically, as a local CPU with enough RAM seems
rarely compute-bound under Linux...

> To do this properly you would need to have /dev/* refer to devices on the machine the process started on, and have /localdev/* have the files that are usually in /dev, and have /remotedev/machinename/* refer to remote devices. /dev/* could be some procfs like file system to support this.
Making the process "open" services rather than device files fixes this whole
problem: if the service is truly machine-specific then a single machine
will be the only provider; if the service is replicated, then requests
will go to the nearest (or most lightly loaded) provider of that service;
and if the service is truly generic (e.g. /dev/zero) then either all
machines will provide the service locally.

> >I think it would be really daring for some of us kernel hackers to try
> >building a facility like this --- certainly we would be leaving every
> >other modern operating system with the taste of dust on their tongues and
> >it could be done in such a way as to appear no different from the user
> >side --- the three machines become like one machine.
> Yes.
Again, since most people have only one machine, and most of the work done
on most machines will be processes that never have and never will migrate,
we do have to be careful.

> >** the shortcomings **
> >This solution would make muliple networked machines like one with more
> >memory, devices, and processors --- it still falls short of bonding
> >together three processors to look like one really fast one, however you
> >wouldn't want to do that over a network anyway unless it was a very fast
> >dedicated fiber optic one.
SMP is already underutilized; to make a program run faster with SMP you
need kernel-level threads running on different CPUs. To allow a process
to run on more than one machine at once, we will first of all run into all
the cache consistency and performance nightmares I'd hoped to avoid with the
service-based scheme; in fact, the shared-memory approach may be the only
way to do this. Forking children and communicating via an explicit block
of shared memory seems ever so much nicer. Fork creates some (solvable)
issues about routing service requests for inherited services (like
open files); separable threads would require all that data to be kept
somewhere on a per-thread basis (harder and messier, but doable). The
important thing is to get the routing of responses right, without fork
itself or any data in memory shared among threads knowing about it.

> > In my scenario, cpu hungry processes would
> >automatically migrate to the most powerful system on the network.
I'd want something a little more intelligent (e.g. load-balancing, esp.
if the most powerful system isn't most powerful by much).

> >This
> >solution also assumes an environment where the like between the computers
> >is not a security problem..
> If all machines have the same root password and use SSL encryption to transfer data then this shouldn't be a problem.
The shared memory model would virtually require a physically secure,
properly-firewalled physical subnet. The service model's local users
will make the same system calls they always have (or be treated like
remote users), and the remote users can be authenticated any way the
implementors please.

> >Feedback would be much appreciated, I would especially like to hear from
> >anyone out there who is interested in such a thing. On major issue would
> >be device name space as major and minor numbers are assumed to be
> >relatively static.
I don't like the idea of nakedly sharing a whole machine with J. Random
AOLuser. However, the majority of the filesystem namespace would have
to be standardized across the cluster for migration to be useful, and
you'd need to maintain a clusterwide service directory and PID-space
(this makes have-I-forked an easy check, and if your service-user
number is your cluster PID for all services, cleaning up after you is
relatively easy).

> /remotedev/machine/* would just be named pipes or something (not real devices).
> /localdev/* would be normal device files
> /dev/* would be sym-links.
> I don't see the problem with this.
> If you're serious about this then the thing to do is write a web page, arrange a mailing list, and make an entry in the LPM. If you have problems with any of these then let me know and I'll help.
I would definitely suggest researching the existing distributed operating
systems before attempting such a project. I think the service-based
approach is the most UNIX-compatible, though it would probably be much
easier to implement around some flavor of MACH than on top of Linux. On
the other hand, there are several operating systems being developed by
researchers where absolutely everything fits into a single (64-bit, if
you're sane; everything includes every byte on everyone's disks!) address
space. Look at what's been done and what's out there, then work towards
an achievable short term goal using an implementation that won't hamper
your future plans. And either keep your starting point in mind (or that
first step won't be short enough to ever finish), or pick a starting
point from among the operating systems that are already out there...