Re: Auto-Adaptive scheduler - Final chapter ( the numbers ) ...

From: Marco Colombo (marco@esi.it)
Date: Mon Jan 31 2000 - 12:36:14 EST


On Mon, 31 Jan 2000, Jamie Lokier wrote:

> Marco Colombo wrote:
> > The question "How many switch / sec will do a system with RQ << 1?"
> > makes no sense since there's no fixed relationship between the two
> > numbers.
>
> That's why the question is asked: what do we actually see when we
> measure different applications?
>
> _If_ we see that low RQ applications don't switch much, then we can
> assert that low RQ applications aren't adversely affected by the patch.

Agreed.

> > [quite unrelated]
> > ? never switches ? - i'm not sure of this. I'd like to see real world
> > figures. Even the single process may sleep for net I/O, i think.
> > So you get some switches. Anyway, that's not the point here.
>
> You don't get switches because an assumption is "when the server is
> busy". We're also assuming enough memory for queues etc. and that the
> I/O subsystem isn't blocking anything. (A network proxy is the only
> example I can think of).
>
> In that case, there is always work to do, and the application always
> does it within its single thread. If some network I/O is blocked, its bit
> isn't set in select() and some other network I/O is done instead.
>
> So the process never sleeps, and of course never schedules to another
> process. (It's single threaded).

Agreed.

> ( If it does sleep for some reason we assume the time spent scheduling
> is then absorbed into the idle period, though it may affect latency. )
>
> > > Therefore in your example, RQ and switching rate are very closely related.
> > They are (in my example).
>
> Good. The rest of the article is written around your example, that is
> why high RQ is considered synonymous with high switching rate :-)

It is not. The message was not written around the example... and...

> > The example was made to show that "RQ" is not related to "load". Under
> > the same load, you can have very different RQ, depending (in my
> > example) on application design. Also it shows that "switch rate" is
> > not related to "load". Again, in my example, it's related to
> > application design instead.
>
> The article is written around the assumption of constant, i.e., maximum
> load. The relationship between "load" and any other parameter is
> therefore not relevant.

... and there's no such un assumption anywere. Even the example does not
assume that: it happens to show that, because it's scope was to show
another thing... but now I see the misunderstanding. B-)

> > The example was not made to show that RQ is not related to switch rate.
> > I can provide a easy one, but i don't think it's necessary, since I'm sure
> > you know there's no such a fixed relationship.
>
> The question is not "is there a fixed relationship", but "what
> relationship do we see?" That depends on the applications that people
> run of course. So it's roughly equivalent to "what applications do
> people run?".

Exacly. So there's no relationship.
Different applications shows different behaviours. It's just that
(given a certain application, then you can see relationships between
switch rate / run queue and so...; that's an application property, not
a general one).

>
> > > Real servers are not so simple of course. But let us stick with the
> > > simple one.
> >
> > Yes. Counter-examples may be simple and silly... B-)
> > Positive examples cannot not.
>
> Oh dear :-)
>
> > > > So please stop stating that:
> > > >
> > > > long RQ <=> high loads (1)
> > >
> > > Your own example just demonstrated that long RQ <=> high switching rate
> >
> > Oh, well, Jamie... below you teach me Logic. I'm pretty sure you know
> > that an example does not (positively) "demonstrate" anything.
> > You can use examples only to demonstrate something is *false*.
>
> Quite. I used your example to demonstrate that the assertion "the
> scheduler should not be patched because the patch adversely affects
> single-threaded servers" is false :-)

Never said that. The single-threaded server example just demonstrated that
you can have high loads without long RQ. Just that.

> > Why? You're assuming (like Davide) that small RQ means small switch rate...
>
> That's because I'm assuming the simple server example like I said I was
> earlier in the message... And in that example, small RQ <=> small
> switch rate.

Ok. Now it's clear. I was in a more general case.

> > Again! Everytime I say "small RQ" you translate into "low switch rate"!!
> >
> > [ "the scheduler isn't entered at all" - what do you mean?
> > I take it for "low switch rate". But I've written "small RQ".
> > Why do you keep mixing two unrelated things? ]
>
> Because when considering only the two extreme implementations of a
> simple server, "low switch rate" and "small RQ" _are_ related.
>
> The article is a bit mixed up on this point -- I said I'm assuming the
> simple server example, but it's easy to read as if I'm talking generalities.

That was not clear in my orginal article. At that point i was really
talking in the more general case.

> > Yes, the scheduler *is* entered a lot of times under (certain kinds of)
> > heavy load, and that's a NORMAL condition. High switch rates MAY be
> > caused by real applications, even optimized ones. That's why the scheduler
> > should be (and is) optimized to be fast. Please note that under this
> > (common, sensible, right, desiderable, what else?) condition Davide's
> > patch fails to be effective because the RQ is small.
>
> Right, but can we have an example of such a "real application" please?

Good point. I'm not able to give you any. I have no figures.
I think (quite a wild guess) that a loaded NFS server will show that.
But I don't have access to such a system.

BTW, I was wrong. High switch rate is not really "desiderable". Every
application should keep it low. But sometimes you tradeoff for low-latency,
for example. So even if it's not "desiderable", it may give you something
(low-latency) that you want, at the price of not being optimal (from a
scheduler point of view).
So I imagine that well written applications, which generate quite a lot
of switches exist.
But I can't think of what you may want a long RQ for. A long RQ gives you
nothing.

> > Also the select() server may switch somethimes...
> > but yes, I *agree* than in real world cases you can have high switch rates...
>
> I agree too, but I'd like an example.

I hope someone else will fill the gap with numbers... I can't.

> > > Now now, that isn't true if all the processes are scheduling around the
> > > same memory image. At that point we might even find that it's kernel
> > > stack cache conflicts that are most significant, in which case we might
> > > want to colour the stack entry position a little...
> >
> > OK. The whole matter is for real world cases. I thought it was clear...
> > Do you think the case you illustraded is common in real world
> > applications?
>
> Consider the canonical static HTTP server which has most of its files
> mmapped. You receive a URL, look up the file in your hash table, and
> write() the file. Really, very little happens in the application in
> that case. The code is shared between all threads, including the kernel
> TCP/IP part. Very little data is copied -- no bulk data, once zero-copy
> TCP sending is implemented. And the hash lookup is optimised for
> minimal cache impact.
>
> Of course you could write that with select(), but then your write()
> calls will block the server for the few pages that aren't in memory.
> Better to use a few threads. I think that could schedule through a few
> threads without too much cache churn.

Yes. To get it *loaded*, you need tens of (absurdly) loaded clients and
a fast eth connection. Or 100 quite loaded clients. So, unless you're
running some kind of benchmark, you've got 100 of humans that repeatedly
hit 'Refresh' on their browers. Reading just static pages (you see,
non static pages defeat the cache argument a bit). How a "real" case...
And if you have 10000+ real clients, well you can afford a second PC, i
think.

> > And BTW, if "the processes are scheduling around the same memory image",
> > i think you are referring to the i-cache. They are doing the same
> > "computation". Are they also processing the same data? If yes, can you
> > even imagine a real, useful application that comes close to that behavior
> > (all the CPUs doing the same thing on the same data)?
>
> Sending files via zero-copy DMA to the network card. The bulk data
> is never seen by the CPU.

Again, you assume many, fast, static requests. It's possible. Not very
common in RL...

> > And if not, it seems clear to me that if you have N threads doing the
> > same job, it's useless to have N bigger than the number of the CPUs,
> > in the general case, unless they sleep for I/O every since and
> > then. So the RQ is small anyway, unless we're on a > 16 CPUs SMP...
>
> Indeed, the optimal number of threads would seem to be where there tends
> to be 1 per CPU plus "a bit" where "a bit" is enough to cover the slack
> due to blocking. And as blocking removes threads from the run queue,
> that keeps the run queue size around, oh, .... <ta da> 1.

You mean N? 1 is the RQ/NCPU ratio...

> You get a more than one in fact because you can't perfectly predict when
> you'll be blocking. You need more RQ load to keep working despite the
> variation in blocking -- and even that isn't perfect. (So perfect might
> be a way to keep that RQ level at 1 at all times, in other words kernel
> assisted thread managament).
>
> But as far as I know, you don't get a lot more than one simply to
> optimise around I/O blocking.
>
> So we can assume the applications the scheduler patch is targetted at
> aren't these applications either. They're not simply "performance tuned
> multi-threaded servers". They're "non-performance tuned multi-threaded
> servers".

So tune the server, not the scheduler.

> > Server design has an impact on switch rate, though. It can be high or low.
> > Anyway, it says *nothing* on the RQ size. Let's assume the server is
> > overloaded, and thus the RQ is long.
>
> You got that from where? The single threaded server will have a maximum
> RQ of 1.

I was talking of the following:
> > > Now now, that isn't true if all the processes are scheduling around the
> > > same memory image. At that point we might even find that it's kernel

it's a multi process (thread) server...
... even without changing the whole server framework, you may optimize it.
And this probably ends up in shortening the RQ anyway. And if you can't
optimize further, and the RQ is still long, you've probably reached an
HW limit. Or you're doing something for which you need better kernel
support (Async IO?). Either case, why change the scheduler?

> > I think you can shorten the RQ even it this case, optimizing the
> > application... and if you can't, either the hardware is overloaded
> > (and there's little you can do, software-side), or some other kernel
> > parts need to be optimized (assuming the application is not at fault).
> > It's not a scheduler problem anyway. So please don't change it. Give
> > us a better poll() , a clone_httpd_thread(), a do_apache_all(), or
> > whatever. But don't change the scheduler, which is not at fault.
>
> The point is that the scheduler patch is not targetted at the kinds of
> applications you're thinking about.
>
> As far as I can see, for all "performance tuned server applications",
> that really are tweaked for performance at the application level, then
> the scheduler we've got is pretty good. And the only remaining
> optimisation is to do with blocking I/O, not run queue length at all.
>
> The scheduler patch is targetted at what's left _after_ you've
> considered all the performance-tuned applications. I.e. Java business
> objects, which aren't performance-tuned by definition: they're written
> using Java threads so performance is not the top priority.
>
> > Ok, I'll dive into it... B-)
> > But the matter is that operating a system with a large RQ it's usally
> > a mistake. So no need to optimize the scheduler for that... fix the problem
> > instead.
>
> The only two known fixes are (a) don't write that kind of program in
> Java; (b) something possibly complicated involving user space threads.
>
> (a) isn't going to happen because, when you get right down to it,
> performance is not actually the top priority. It's important, but it
> takes second place or lower to other stuff.

So why bothering of patching the scheduler? Better keep a clean design.
You're saying: "I don't really care of pref, so I write my application
in Java, not in C. But since I care of perf, you kernel guys change the
scheduler to better support my sub-optimal design..."
If performance is not a concern, for you, use the standard kernel.
If you're concerned about performance, write a better application, or
a better Java implementation, or a better language that allows a fast
implementation.

And BTW, I doubt Java *forces* you in writing applications with 1000+
threads.

> > So, what's the purpose of the patch? To support broken application designs?
> > If it comes at no costs, I agree. Davide's "no cost" argument is simply
> > based on the wrong assumpion that "short RQ" means "low switch rate".
>
> Correct, that is the purpose of the patch! :-)
>
> The patch is to improve the performance of applications that _aren't_
> written in the best way for performance. Because changing the
> applications is too difficult -- changing the kernel is not.
>
> have a nice day,
> -- Jamie

If you assume that, in general, all well written application do not
switch a lot, what's the purpose of an optimized scheduler, in the
very beginning?

Well, let's move it to the new thread. But we agree on many things, and
I can't answer to all of your questions. See my previous message on your
thread.

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 21:00:29 EST