Re: kernel thread support - LWP's

Larry McVoy (lm@bitmover.com)
Thu, 15 Jul 1999 17:07:23 -0600


: Larry McVoy wrote:
: > : well, context switches are painful as is any kernel crossing in high
: > : performance computing. imagine user level networking on high speed
: > : connections that can have round trip times in the ~50us range (this is
: > : a software implementation in our lab, SGI's GSN is committed to round
: > : trip times of around 7us roundtrip hardware latency), if you
: >
: > I've (a) spent a great deal of time thinking about this very issue, and
: > (b) worked on GSN at SGI, and (c) am under contract with LLNL working
: > on exactly this issue, amongst others. I'm pretty in tune with the
: > problem space and I don't see that it has any bearing on the discussion
: > at all. If you are going to context switch for each packet, you can
: > kiss your performance good bye whether you are context switching threads
: > or processes. Neither are fast enough to hit the needed 10 usec round
: > trip time that all the HPC folks like LLNL want.
:
: Hi Larry, there is someone in our group at CERN also working on user
: level threads. His measurements (benchmarks in L1 cache of course) are
: 0.05 microseconds for context switch in user space.

Interesting. I just coded up a little benchmark that shows .05 usecs
is 2x what a procedure call costs on a 400Mhz Celeron. Kinda makes
me wonder exactly what sort of "context" he is saving and restoring.
I kind of doubt he's saving/restoring everything, like floating point
registers, etc. But whatever.

: Now you can say that a real app will swamp this with cache misses. But
: when it's within the cache, ~2-3 microseconds kernel vs. 0.05
: microseconds user is a pretty severe difference.

Really? I doubt it. I understand the need I just think he's going about it
wrong. If what you want is low latency packet transfers, the fastest way
is no context switch at all. The device should place the data in memory and
you should be sitting there waiting for it.

: > Agreed with the first part, couldn't agree with the second part - it ain't
: > happening - the context switches will be kernel level context switches
: > whether they are "threads" or "processes" since the event generated
: > is a kernel level event.
:
: Now you're generalising... the system here responds to events entirely
: in user space.

Not really. The device generating the packets runs kernel code, does it
not?

: > Yeah, you can deliver the packet into user space directly, but have
: > fun getting the kernel to tell your user level scheduler to run a new
: > thread. Sure it can be done, and has been done, but an old quote of
: > mine is "Architect: someone who knows the difference between what
: > could be done and what should be done". My architect hat says this is
: > not "a should be done", your view may be different.
:
: I kinda agree that polling a device using modified-compiler generated
: code does not look like the right way at first... but this model is the
: only one I know of where an Intel box can saturate a Gigabit Ethernet
: link in both directions at once, with 6% CPU load and consistently <50
: microseconds response latency (min. 25 microseconds).

If you are running TCP/IP, I'm very impressed. If not, I'm not. If you
are just blasting and receiving ethernet packets, so what?

: I'm not advertising as it's not my work. Just observing that no other
: model is close to this performance as far as I know.

Again, if you are comparing apples to apples, you have a fantastic
point and I want to learn more and I'll happily eat my words in public.
But if you are comparing TCP/IP performance with raw packet performance,
that's like comparing a Geo with a Ferrari. Not exactly meaningful.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/