Re: cluster thoughts

Albert D. Cahalan (acahalan@cs.uml.edu)
Sun, 14 Jun 1998 00:48:43 -0400 (EDT)


Larry McVoy writes:

> Moving on to a more positive note, I personally hate it when grey
> haired architects tell me my ideas are stupid but don't tell me the
> right answer. So here's a little sketch of what I think is the right
> direction for clustering.

These guidelines are great for how you would use clustering
in your environment. They won't fit all other environments.
This "right answer" is specific to you.

> Things which are bad:
>
> . Distributed shared memory. It has no failure model - what do you do
> when you page fault on page that is no longer there because that node
> crashed? There is no soution, you simply can't allow that to happen,
> which means no distributed shared memory.

If you want to build a reliable system from scratch, yes.
Not everybody wants to do that. Some people have existing code
that they want to port with minimal effort. It might be just
number crunching, not financial transactions.

> . Remote procedure calls. RPCs block. It's a nice textbook way to tell
> people to think, but it is horribly slow in practice. Use messages,
> trasaction IDs and queues instead. A single process can handle 1000's
> or 10's of thousands of messages / second.

Being "horribly slow" matters to you, but not to everyone. Sometimes
the RPC overhead is insignificant when compared to the work being done
by each call. Many people have existing software that they don't wish
to rewrite.

> . Process migration. Sounds good, but is way too costly to be of any
> use. Good explanations of why may be found in any paper on scheduling
> algorthms and their effect on caches. Most of these papers end up
> talking about "cache affinity" which is a fancy way of saying "run
> once on CPU 3, run always on CPU 3". In order to see the relationship
> to process migration, view the scheduler as the same as the migrator,
> and the processor cache the same as the page cache, and everything
> else is the same.

It looks to me like you gave a reason _for_ process migration.
Linux does move processes from one CPU to another. Going from
one clustor node to another is indeed the same idea. Migration
is more powerful _and_ more expensive, so the node affinity must
be stronger than processor affinity. For example, migration could
be forced to wait 20 seconds.

> . Dynamic load balancing across machines. This is only an option if
> you have process migration, which we've ruled out. We accomplish
> load balancing statically, see below.

Dynamic load balancing must make some assumptions about the future,
but this problem is also faced by the scheduler and swapper.
It is just an imperfection that we must live with.

> . Remote fork(). It's a nice idea, but it doesn't work well in practice.

Sure, dynamic migration eliminates the need.

> Things which are good:
>
> . Local and global names.

I think that could be a mess. Think of the whole cluster as
a NUMA machine with a single system image.

> . Messages and message queues. Messages are nice because they don't
> block. A great programming model is to have a queue of outstanding

That is great for _new_ code.

> . Remote exec. You have to have some way of fanning out load
> across the system. You want to have a REMOTE_ON_EXEC flag which
> tells the OS that it is OK to consider this process for remote
> exec when doing a regular exec(). The flag gets cleared once the
> process execs; it has to be manually reset. This lets you do stuff
> like have login & make and other simiarl process spawners fan out
> the load without worrying about the subprocesses being fanned out.

It is better to have hints for both ways, and to let the OS ignore
the hints as needed.

> . Cache coherent filesystems. NFS is a fine solution for a campus
> but it sucks for a cluster. The lack of coherency makes it almost

Perhaps it is better to share the page cache. When a filesystem is
mounted, it is mounted by all nodes at once. The actual filesystem
type doesn't matter. An external NFS server could see the whole
cluster as one machine that only mounts the filesystem once,
Nodes within the cluster can use a read-write locking scheme to
share the page cache without data corruption.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu