Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)

Ingo Molnar (mingo@chiara.csoma.elte.hu)
Sun, 2 May 1999 14:12:39 +0200 (CEST)


This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

---1247997369-859512174-925647159=:21826
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Sun, 2 May 1999, Mark Russinovich wrote:

> >first he claims Linux has only select(), and then he continues to bash
> >select(). (without providing measurements or benchmark numbers) Then he
> >says that Linux _does_ have asy nchron IO events implemented in 2.2 but
> >says that they have 'two major limitations'. Both 'limitations' he
> >mentions are in fact a pure implementation matter and not a mechanism or
> >API limitation. Mark also forgot to mention that Linux asynchron IO is
> >superior to NT because we do not have to poll the completion port for
> >events, we can have the IO event delivered _immediately_ to the target
> >thread (which is preempted by a signal if it's running). This gives more
> >flexibility of using asynchron events. (i have pointed out this difference
> >to him in private discussions, he left this argument unanswered)
> >
>
> Completion ports in NT require no polling and no linear searching - that,
> and their integration with the scheduler, is their entire reason for
> existence. [...]

they require a thread to block on completion ports, or to poll the status
of the completion port. NT gives no way to asynchronously send completion
events to a _running_ thread.

> [...] Also, Linux's implementation of asynchronous I/O only applies to
> tty devices and to *new connections* on sockets - nothing else. [...]

yes, networking is the main user of asynchronous events. Given that
asynchronous IO is rather new under Linux, it was a natural choice.

> Sure asynchronous I/O can be added to the rest of the I/O architecture

no. I personally think that networking is about the only place where this
technique has a long term future ... do you suggest that any 'enterprise
server' is IO-bound on block devices? But yes, it can be added. (squid for
one could benefit from it, but even squid is typically memory or disk seek
time limited)

> >here he again forgets to _prove_ that overscheduling happens in Linux.
> >Measurements have been done on big busy Linux webservers (much bigger than
> >the typical 'enterprise' category), and the runqueue lenghth (number of
> >threads competing for requests) was 3-4 typically. Enuff said ...
> >
>
> Under high load environments even the short run-queue lengths you refer to
> are enough to degrade performance. And in the environments I'm talking
> about where there are several hundred requests being served concurrently,
> the run queue lengths for Linux are significantly higher with the
> implementation of a one-thread-to-one-client server model.

do you suggest Dejanews does not work? You are often taking architectural
examples from the NT side, without measuring the Linux side. I actually
have a test-setup here that does 2000 new Apache connections a second
(over a real network), and no, we do not 'overschedule'.

It's often apples to oranges, and i'd really suggest you that before you
bash any architectural solution (in _any_ OS) as a 'severe limitation' you
better be damn sure right, or wear asbestos. I hope i'm not sounding
arrogant, _if_ we get into an overscheduling situation on the networking
side we already have plans to address it (with a few lines of change), but
currently it's not necessery. I have seen no request for discussion from
you on linux-kernel about overscheduling.

> >'kernel reentrancy'
> >-------------------
> >
> >his example is a clear red herring. If any Linux application is
> >read()/write() intensive to the page cache, it should better use mmap(). I
> >can understand Mark did not mention mmap(), NT has a rather inferior
> >mmap() implementation. (eg. read()/write() and mmap()-ed modifications
> >done to the same file are not guaranteed to be data-coherent by NT ...)
> >His threading point is correct, there is still code left to be threaded
> >for SMP operation. Just as NT has one single big lock in it's networking
> >stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
> >beta status.)
>
> First, serialization of long paths through the kernel degrade
> multiprocessor scalability - this is multiprocessing 101.

yes, sure. Do they make an OS 'unable to handle the enterprise category',
nope. Just like NT's deficiencies do not necesserily make it incapable. As
i've explained to you, much of the Linux IO path (the interrupt part) goes
under a different lock.

> You mention mmap, and I'm assuming you do so as an alternative to sendfile.

not at all. You mentioned cached read()/write(), and i just pointed out
that if you do heavy cached read()s and write()s then you do the wrong
thing. I've attached a patch from David S. Miller that deserializes much
of the 'heavy' parts of reads and writes in the ext2, pipe, TCP and
AF_UNIX path. The patch adds 50 new lines. (just that people get a picture
about the magnitude of these 'severe limitations') But yes, Linux still
has a way to go.

> BTW This isn't related to read-only file serving, but Linus admits that
> mmap in 2.2 has a flaw where write-backs to a modified file result in two
> copies instead of 1. He says that this will probably be fixed in 2.3.x.

yes this is a known problem. (_this_ is what i consider to be one of the
top Linux problems, not the other ones you mention.)

> This implementation has 0 buffer copies and requires 1 system call to send
> an entire HTTP response. There is no manipulation of process address space,
> and the server need not manage its own file cache. In addition, the call
> can be made asynchronously, where waiting is done on a completion port that
> is waiting on new connections and more requests on existing connections.
> The asynchronous I/O model in NT extends to all I/O. NT (and Solaris,
> HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex
> (the name of the NT version). This API is used to simultaneously perform an
> accept, the first read(), and geetpeer() in one system call. The advantages
> should be obvious.

_please_, could you time NT Solaris and HP/UX, how much they take for a
single sendfile() system call, and compare it to Linux null syscall
latencies? The Linux numbers are:

[mingo@moon l]$ ./lat_syscall null
Simple syscall: 0.8403 microseconds

one reason we made syscalls so lightweight is to avoid silly
'multi-purpose' conglomerate system calls like NT has. sendfile() has
mainly not been added to avoid system calls being done, but because it's
strong (and unique) conceptual foundations. Linux syscalls will be even
more lightweight in the future. (i have a prototype patch that makes them
cost 0.30 microseconds) Do you see the point, again an apples to oranges
problem.

> As for the Linux implementation of sendfile(), it does not support adding a
> header and the Linux TCP stack does not support 0-copy sends. Thus, there
> is an extra system call and buffer copy for a write() to send the header,
> and an extra buffer copy for sending the file.
[...]

> Just to clarify, the Linux TCP/IP stack does not support 0-copy sending.

zero-copy has backdraws too. (latency ones mainly) you seem to be very
much focused on bandwith, but thats not everything. Could you please
compare the latencies of the Linux and NT TCP stack? (i have) Or do you
believe that latency does not matter? But yes in certain circumstances we
want to have zero-copy. (sendfile is one such example)

> >in private discussions with Mark i have pointed out most of these
> >counter-arguments, which he unfortunately failed to answer. He also didnt
> >answer my questions about NT's shortcomings in the above areas. (as
> >always, seemingly powerful concepts can often open up ugly ratholes)
> >Different OS, different approach. Let the numbers talk.
>
> I try to answer all e-mail that raise technical issues. If I failed to
> answer yours, Ingo, then it was simply because I was too busy.

my major problem with your analysis is that in my opinion you paint a
one-sided picture, NT always on the 'winner' side, and Linux on the
'loser' side. Am i correct to understand that you consider Linux to be an
inferior design? I think there are two more technical issues you left
unanswered previously:

- CPU-specific optimizations. NT offers one single binary image for all
x86 CPU architectures. (barring the SMP/UP distinction) How do you explain
the speed penalty to your 'enterprise costumers'? The same holds for
CPU-specific assembly optimizations.

- NT's 'hidden locks'. Just as NT4 SP5 beta introduced 'deserialization'
silently into the networking code. (and certainly they claimed NT to be in
the 'enterprise category' years before) Are you 100% sure there are no
other NT subsystems left out 'accidentally' that make it incapable of
handling the load of 'enterprise class servers'. How can you be sure that
NT's TCP timers are scalable? You do not seem to _honor_ and balance the
fact that Linux has all it's source code out there, and thus yes all the
mistakes are visible. NT is basically a black box. You quote manuals from
NT instead of source code. Then you compare that to NT without doing
head-to-head measurements.

-- mingo

---1247997369-859512174-925647159=:21826
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="davem-smpthreading-2.2.7-A0"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.LNX.3.96.990502141239.21826E@chiara.csoma.elte.hu>
Content-Description:

LS0tIGxpbnV4L2ZzL2V4dDIvZmlsZS5jLm9yaWcJVHVlIERlYyAyOSAxNToz
NzowMSAxOTk4DQorKysgbGludXgvZnMvZXh0Mi9maWxlLmMJU3VuIE1heSAg
MiAxMjoyNDo0MSAxOTk5DQpAQCAtMzAsNiArMzAsNyBAQA0KICNpbmNsdWRl
IDxsaW51eC9sb2Nrcy5oPg0KICNpbmNsdWRlIDxsaW51eC9tbS5oPg0KICNp
bmNsdWRlIDxsaW51eC9wYWdlbWFwLmg+DQorI2luY2x1ZGUgPGxpbnV4L3Nt
cF9sb2NrLmg+DQogDQogI2RlZmluZQlOQlVGCTMyDQogDQpAQCAtMjU3LDcg
KzI1OCw5IEBADQogCQkJCWJyZWFrOw0KIAkJCX0NCiAJCX0NCisJCXVubG9j
a19rZXJuZWwoKTsNCiAJCWMgLT0gY29weV9mcm9tX3VzZXIgKGJoLT5iX2Rh
dGEgKyBvZmZzZXQsIGJ1ZiwgYyk7DQorCQlsb2NrX2tlcm5lbCgpOw0KIAkJ
aWYgKCFjKSB7DQogCQkJYnJlbHNlKGJoKTsNCiAJCQlpZiAoIXdyaXR0ZW4p
DQotLS0gbGludXgvZnMvcGlwZS5jLm9yaWcJVHVlIE5vdiAyNCAxMTo0MToy
OCAxOTk4DQorKysgbGludXgvZnMvcGlwZS5jCVN1biBNYXkgIDIgMTI6MjQ6
NDEgMTk5OQ0KQEAgLTcsNiArNyw3IEBADQogI2luY2x1ZGUgPGxpbnV4L21t
Lmg+DQogI2luY2x1ZGUgPGxpbnV4L2ZpbGUuaD4NCiAjaW5jbHVkZSA8bGlu
dXgvcG9sbC5oPg0KKyNpbmNsdWRlIDxsaW51eC9zbXBfbG9jay5oPg0KIA0K
ICNpbmNsdWRlIDxhc20vdWFjY2Vzcy5oPg0KIA0KQEAgLTY4LDcgKzY5LDkg
QEANCiAJCVBJUEVfU1RBUlQoKmlub2RlKSAmPSAoUElQRV9CVUYtMSk7DQog
CQlQSVBFX0xFTigqaW5vZGUpIC09IGNoYXJzOw0KIAkJY291bnQgLT0gY2hh
cnM7DQorCQl1bmxvY2tfa2VybmVsKCk7DQogCQljb3B5X3RvX3VzZXIoYnVm
LCBwaXBlYnVmLCBjaGFycyApOw0KKwkJbG9ja19rZXJuZWwoKTsNCiAJCWJ1
ZiArPSBjaGFyczsNCiAJfQ0KIAlQSVBFX0xPQ0soKmlub2RlKS0tOw0KQEAg
LTEzNCw3ICsxMzcsOSBAQA0KIAkJCXdyaXR0ZW4gKz0gY2hhcnM7DQogCQkJ
UElQRV9MRU4oKmlub2RlKSArPSBjaGFyczsNCiAJCQljb3VudCAtPSBjaGFy
czsNCisJCQl1bmxvY2tfa2VybmVsKCk7DQogCQkJY29weV9mcm9tX3VzZXIo
cGlwZWJ1ZiwgYnVmLCBjaGFycyApOw0KKwkJCWxvY2tfa2VybmVsKCk7DQog
CQkJYnVmICs9IGNoYXJzOw0KIAkJfQ0KIAkJUElQRV9MT0NLKCppbm9kZSkt
LTsNCi0tLSBsaW51eC9tbS9maWxlbWFwLmMub3JpZwlUdWUgQXByICA2IDE0
OjM1OjA5IDE5OTkNCisrKyBsaW51eC9tbS9maWxlbWFwLmMJU3VuIE1heSAg
MiAxMjoyNDo0MSAxOTk5DQpAQCAtMjM0LDEwICsyMzQsMjEgQEANCiANCiAJ
CWlmIChsZW4gPiBjb3VudCkNCiAJCQlsZW4gPSBjb3VudDsNCisJcmV0cnk6
DQogCQlwYWdlID0gZmluZF9wYWdlKGlub2RlLCBwb3MpOw0KIAkJaWYgKHBh
Z2UpIHsNCi0JCQl3YWl0X29uX3BhZ2UocGFnZSk7DQorCQkJaWYgKFBhZ2VM
b2NrZWQocGFnZSkpIHsNCisJCQkJd2FpdF9vbl9wYWdlKHBhZ2UpOw0KKwkJ
CQlyZWxlYXNlX3BhZ2UocGFnZSk7DQorCQkJCWdvdG8gcmV0cnk7DQorCQkJ
fQ0KKwkJCXNldF9iaXQoUEdfbG9ja2VkLCAmcGFnZS0+ZmxhZ3MpOw0KKwkJ
CXVubG9ja19rZXJuZWwoKTsNCiAJCQltZW1jcHkoKHZvaWQgKikgKG9mZnNl
dCArIHBhZ2VfYWRkcmVzcyhwYWdlKSksIGJ1ZiwgbGVuKTsNCisJCQlsb2Nr
X2tlcm5lbCgpOw0KKwkJCWNsZWFyX2JpdChQR19sb2NrZWQsICZwYWdlLT5m
bGFncyk7DQorCQkJaWYgKHdhaXRxdWV1ZV9hY3RpdmUoJnBhZ2UtPndhaXQp
KQ0KKwkJCQl3YWtlX3VwKCZwYWdlLT53YWl0KTsNCiAJCQlyZWxlYXNlX3Bh
Z2UocGFnZSk7DQogCQl9DQogCQljb3VudCAtPSBsZW47DQpAQCAtNzc1LDcg
Kzc4Niw5IEBADQogDQogCWlmIChzaXplID4gY291bnQpDQogCQlzaXplID0g
Y291bnQ7DQorCXVubG9ja19rZXJuZWwoKTsNCiAJbGVmdCA9IF9fY29weV90
b191c2VyKGRlc2MtPmJ1ZiwgYXJlYSwgc2l6ZSk7DQorCWxvY2tfa2VybmVs
KCk7DQogCWlmIChsZWZ0KSB7DQogCQlzaXplIC09IGxlZnQ7DQogCQlkZXNj
LT5lcnJvciA9IC1FRkFVTFQ7DQotLS0gbGludXgvbmV0L3VuaXgvYWZfdW5p
eC5jLm9yaWcJVHVlIEFwciAgNiAxNDozNToxMCAxOTk5DQorKysgbGludXgv
bmV0L3VuaXgvYWZfdW5peC5jCVN1biBNYXkgIDIgMTI6MjQ6NDEgMTk5OQ0K
QEAgLTEwMyw2ICsxMDMsNyBAQA0KICNpbmNsdWRlIDxuZXQvc2NtLmg+DQog
I2luY2x1ZGUgPGxpbnV4L2luaXQuaD4NCiAjaW5jbHVkZSA8bGludXgvcG9s
bC5oPg0KKyNpbmNsdWRlIDxsaW51eC9zbXBfbG9jay5oPg0KIA0KICNpbmNs
dWRlIDxhc20vY2hlY2tzdW0uaD4NCiANCkBAIC05ODEsNyArOTgyLDExIEBA
DQogCQl1bml4X2F0dGFjaF9mZHMoc2NtLCBza2IpOw0KIA0KIAlza2ItPmgu
cmF3ID0gc2tiLT5kYXRhOw0KKw0KKwl1bmxvY2tfa2VybmVsKCk7DQogCWVy
ciA9IG1lbWNweV9mcm9taW92ZWMoc2tiX3B1dChza2IsbGVuKSwgbXNnLT5t
c2dfaW92LCBsZW4pOw0KKwlsb2NrX2tlcm5lbCgpOw0KKw0KIAlpZiAoZXJy
KQ0KIAkJZ290byBvdXRfZnJlZTsNCiANCkBAIC0xMTM3LDExICsxMTQyLDE5
IEBADQogCQlpZiAoc2NtLT5mcCkNCiAJCQl1bml4X2F0dGFjaF9mZHMoc2Nt
LCBza2IpOw0KIA0KLQkJaWYgKG1lbWNweV9mcm9taW92ZWMoc2tiX3B1dChz
a2Isc2l6ZSksIG1zZy0+bXNnX2lvdiwgc2l6ZSkpIHsNCi0JCQlrZnJlZV9z
a2Ioc2tiKTsNCi0JCQlpZiAoc2VudCkNCi0JCQkJZ290byBvdXQ7DQotCQkJ
cmV0dXJuIC1FRkFVTFQ7DQorCQl7DQorCQkJaW50IGVycjsNCisNCisJCQl1
bmxvY2tfa2VybmVsKCk7DQorCQkJZXJyID0gbWVtY3B5X2Zyb21pb3ZlYyhz
a2JfcHV0KHNrYixzaXplKSwgbXNnLT5tc2dfaW92LCBzaXplKTsNCisJCQls
b2NrX2tlcm5lbCgpOw0KKw0KKwkJCWlmKGVycikgew0KKwkJCQlrZnJlZV9z
a2Ioc2tiKTsNCisJCQkJaWYgKHNlbnQpDQorCQkJCQlnb3RvIG91dDsNCisJ
CQkJcmV0dXJuIC1FRkFVTFQ7DQorCQkJfQ0KIAkJfQ0KIA0KIAkJb3RoZXI9
dW5peF9wZWVyKHNrKTsNCkBAIC0xMjE5LDcgKzEyMzIsMTAgQEANCiAJZWxz
ZSBpZiAoc2l6ZSA8IHNrYi0+bGVuKQ0KIAkJbXNnLT5tc2dfZmxhZ3MgfD0g
TVNHX1RSVU5DOw0KIA0KKwl1bmxvY2tfa2VybmVsKCk7DQogCWVyciA9IHNr
Yl9jb3B5X2RhdGFncmFtX2lvdmVjKHNrYiwgMCwgbXNnLT5tc2dfaW92LCBz
aXplKTsNCisJbG9ja19rZXJuZWwoKTsNCisNCiAJaWYgKGVycikNCiAJCWdv
dG8gb3V0X2ZyZWU7DQogDQpAQCAtMTMzOSwxMSArMTM1NSwxOSBAQA0KIAkJ
fQ0KIA0KIAkJY2h1bmsgPSBtaW4oc2tiLT5sZW4sIHNpemUpOw0KLQkJaWYg
KG1lbWNweV90b2lvdmVjKG1zZy0+bXNnX2lvdiwgc2tiLT5kYXRhLCBjaHVu
aykpIHsNCi0JCQlza2JfcXVldWVfaGVhZCgmc2stPnJlY2VpdmVfcXVldWUs
IHNrYik7DQotCQkJaWYgKGNvcGllZCA9PSAwKQ0KLQkJCQljb3BpZWQgPSAt
RUZBVUxUOw0KLQkJCWJyZWFrOw0KKwkJew0KKwkJCWludCBlcnI7DQorDQor
CQkJdW5sb2NrX2tlcm5lbCgpOw0KKwkJCWVyciA9IG1lbWNweV90b2lvdmVj
KG1zZy0+bXNnX2lvdiwgc2tiLT5kYXRhLCBjaHVuayk7DQorCQkJbG9ja19r
ZXJuZWwoKTsNCisNCisJCQlpZihlcnIpIHsNCisJCQkJc2tiX3F1ZXVlX2hl
YWQoJnNrLT5yZWNlaXZlX3F1ZXVlLCBza2IpOw0KKwkJCQlpZiAoY29waWVk
ID09IDApDQorCQkJCQljb3BpZWQgPSAtRUZBVUxUOw0KKwkJCQlicmVhazsN
CisJCQl9DQogCQl9DQogCQljb3BpZWQgKz0gY2h1bms7DQogCQlzaXplIC09
IGNodW5rOw0KLS0tIGxpbnV4L25ldC9pcHY0L3RjcC5jLm9yaWcJU3VuIE1h
eSAgMiAxMjoyMzoyNCAxOTk5DQorKysgbGludXgvbmV0L2lwdjQvdGNwLmMJ
U3VuIE1heSAgMiAxMjoyNDo0MSAxOTk5DQpAQCAtNDE1LDYgKzQxNSw3IEBA
DQogI2luY2x1ZGUgPGxpbnV4L3R5cGVzLmg+DQogI2luY2x1ZGUgPGxpbnV4
L2ZjbnRsLmg+DQogI2luY2x1ZGUgPGxpbnV4L3BvbGwuaD4NCisjaW5jbHVk
ZSA8bGludXgvc21wX2xvY2suaD4NCiAjaW5jbHVkZSA8bGludXgvaW5pdC5o
Pg0KIA0KICNpbmNsdWRlIDxuZXQvaWNtcC5oPg0KQEAgLTkwNSw2ICs5MDYs
OCBAQA0KIAkJCQljb250aW51ZTsNCiAJCQl9DQogDQorCQkJdW5sb2NrX2tl
cm5lbCgpOw0KKw0KIAkJCXNlZ2xlbiAtPSBjb3B5Ow0KIA0KIAkJCS8qIFBy
ZXBhcmUgY29udHJvbCBiaXRzIGZvciBUQ1AgaGVhZGVyIGNyZWF0aW9uIGVu
Z2luZS4gKi8NCkBAIC05MjMsOCArOTI2LDEwIEBADQogCQkJICogUmVzZXJ2
ZSBoZWFkZXIgc3BhY2UgYW5kIGNoZWNrc3VtIHRoZSBkYXRhLg0KIAkJCSAq
Lw0KIAkJCXNrYl9yZXNlcnZlKHNrYiwgTUFYX0hFQURFUiArIHNrLT5wcm90
LT5tYXhfaGVhZGVyKTsNCisNCiAJCQlza2ItPmNzdW0gPSBjc3VtX2FuZF9j
b3B5X2Zyb21fdXNlcihmcm9tLA0KIAkJCQkJc2tiX3B1dChza2IsIGNvcHkp
LCBjb3B5LCAwLCAmZXJyKTsNCisJCQlsb2NrX2tlcm5lbCgpOw0KIA0KIAkJ
CWlmIChlcnIpDQogCQkJCWdvdG8gZG9fZmF1bHQ7DQpAQCAtMTI4Niw3ICsx
MjkxLDExIEBADQogCQkgKglkbyBhIHNlY29uZCByZWFkIGl0IHJlbGllcyBv
biB0aGUgc2tiLT51c2VycyB0byBhdm9pZA0KIAkJICoJYSBjcmFzaCB3aGVu
IGNsZWFudXBfcmJ1ZigpIGdldHMgY2FsbGVkLg0KIAkJICovDQorDQorCQl1
bmxvY2tfa2VybmVsKCk7DQogCQllcnIgPSBtZW1jcHlfdG9pb3ZlYyhtc2ct
Pm1zZ19pb3YsICgodW5zaWduZWQgY2hhciAqKXNrYi0+aC50aCkgKyBza2It
PmgudGgtPmRvZmYqNCArIG9mZnNldCwgdXNlZCk7DQorCQlsb2NrX2tlcm5l
bCgpOw0KKw0KIAkJaWYgKGVycikgew0KIAkJCS8qIEV4Y2VwdGlvbi4gQmFp
bG91dCEgKi8NCiAJCQlhdG9taWNfZGVjKCZza2ItPnVzZXJzKTsNCg==
---1247997369-859512174-925647159=:21826--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/