Re: TOE brain dump

From: Eric W. Biederman (ebiederm@xmission.com)
Date: Tue Aug 05 2003 - 12:19:09 EST


Werner Almesberger <werner@almesberger.net> writes:

> Eric W. Biederman wrote:
> > The optimized for low latency cases seem to have a strong
> > market in clusters.
>
> Clusters have captive, no, _desperate_ customers ;-) And it
> seems that people are just as happy putting MPI as their
> transport on top of all those link-layer technologies.

MPI is not a transport. It an interface like the Berkeley sockets
layer. The semantics it wants right now are usually mapped to
TCP/IP when used on an IP network. Though I suspect SCTP might
be a better fit.

But right now nothing in the IP stack is a particularly good fit.

Right now there is a very strong feeling among most of the people
using and developing on clusters that by and large what they are doing
is not of interest to the general kernel community, and so has no
chance of going in. So you see hack piled on top of hack piled on
top of hack.

Mostly I think the that is less true, at least if they can stand the
process of severe code review and cleaning up their code. If we can
put in code to scale the kernel to 64 processors. NIC drivers for
fast interconnects and a few similar tweaks can't hurt either.

But of course to get through the peer review process people need
to understand what they are doing.

> > There is one place in low latency communications that I can think
> > of where TCP/IP is not the proper solution. For low latency
> > communication the checksum is at the wrong end of the packet.
>
> That's one of the few things ATM's AAL5 got right. But in the end,
> I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
> flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
> you may well treat it as an atomic unit.

So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer
switch chip + 1.3us to the receiving NIC + 1.3us the receiver.

1.3us * 7 = 9.1us to deliver a packet to the other side. That is
still quite painful. Right now I can get better latencies over any of
the cluster interconnects. I think 5 us is the current low end, with
the high end being about 1 us.

Quite often in MPI when a message is sent the program cannot continue
until the reply is received. Possibly this is a fundamental problem
with the application programming model, encouraging applications to
be latency sensitive. But it is a well established API and
programming paradigm so it has to be lived with.

All of this is pretty much the reverse of the TOE case. Things are
latency sensitive because real work needs to be done. And the more
latency you have the slower that work gets done.

A lot of the NICs which are used for MPI tend to be smart for two
reasons. 1) So they can do source routing. 2) So they can safely
export some of their interface to user space, so in the fast path
they can bypass the kernel.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Aug 07 2003 - 22:00:29 EST