Lets get this right (WAS RE:MOSIX and kernel mods)

Sean Hunter (sean@uncarved.co.uk)
Sat, 6 Mar 1999 10:26:02 +0000

We need to all put the guns down and think a bit.

Moving to a distributed approach to computing is a fundamental change,
and if we make the wrong design and architecture decisions now, we
will be stuck with them for a while. If commercial linux apps come to
rely on a distributed API we provide, we had better be sure that it
doesn't suck, because as the user- and software base for linux gets
bigger and bigger, these things get harder and harder and more and
more painful to put right. Thinking "Linus will make the right
descision" is a cop out. We need to think this through ourselves.

To summarise, two lines of argument have emerged so far in this

1)"DSM sucks because it encourages bad code. It shouldn't go in"

2)"DSM is an attractive abstraction because it extends stuff we use
already on a single node to work on the network. It can suck, but not
too badly. It should go in."

For what its worth, I am firmly in the former camp. Linux as an OS is
all about excellence. It doesn't have to cater to commercial
considerations about time(or cost)-to-market because of the way in
which development is done. It doesn't have to provide a second-best
solution because that's all "the development team" has been able to
come up with. It doesn't have to put something into the kernel just
because it makes life easier for some people in the short term, and we
all know that we can't afford to accept sloppiness just because people
say "Well, it's a compile option, you can just leave it out."

We need a solution that we can live with for the next ten years, not
one that works now, and then we have to ditch at the next major
release with much gnashing of teeth.

I'd like to ask (and then discuss) some questions:

1)Is distributed IPC a useful abstraction?
2)Is distributed shared memory a necessary part of that abstraction?
3)Does this of necessity _have_ to be part of the kernel?

Question 1. This is about "why do we want to do this?" Are we trying
to do it because:

a)it sounds hip in a press release?
b)it'll get me my phd?
c)I'll have a bit of my own code in the kernel, girls will think I'm
attractive etc etc
d)There are real benefits to real applications.

Why might we want to go distributed? The obvious answers are:
a)speed - My job is inherently paralell and I just can't get enough
raw processing on a single node.
b)use of resources - I've got all these computers sitting idle when
they could be predicting the stock market for me.

Now, if we want to go distributed to make things run faster, then we'd
better be sure that our task is suited to it, that our program is
well-written and our API is fast, otherwise we have a performance loss
when we wanted a performance gain.

If you just want to use up spare cycles on boxes on the network, and
don't care whether or not it's fast, ask yourself "Does this really
call for an operating-system change, or could I not just, say, 'spend'
those cycles on a quake server?" 8^)

Question 2: This is about "Do our apps need to know about the
network?". Your answer to this will depend on your answer to the
first question. If you want your app to be fast and distributed, you
probably want an abstraction that helps you to write a fast,
distributed app.

Question 3: This is two parts: "Will this tie our hands in the
future?", and it's corollary: "Will this benefit us so much that we
will still be prepared to support it when we aren't writing our phd?"
I don't mean to sound cynical, but everyone who has worked in a large
commercial software environment knows that the cost of a bad decision
in software architecture and design tends to snowball. This is
because as you have more and more takeup for your idea, the cost of
correcting changes "downstream" becomes greater and greater. It we
think something is a little unsightly now, once some big software
house writes _the_ linux killer app using this feature, our design
flaw quickly escalates from "lame duck" through "albatross" to
fully-fledged "millstone round our neck" status.

Finally, if you don't read anything else I have written read this.
Ask yourself "Is this the best we can do?". Why should we settle for
anything less that the best solution? We have some of the best
programmers ever. We have some of the finest design-heads ever. We
have the largest body of developers/testers ever. We have Linus. If
distributed programming gets people excited, why don't they write an
API that we can all get excited about. An API that won't have us 5
years down the line having to say "well, we'd love to do that,
but...". This isn't like writing a device driver. This is a major
facility which software writers will use if it's provided. If its
done right, it could provide tremendous benefits. If done wrong,
we'll all suffer the consequences.

Sean Hunter

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/