Re: How to increat [sic.] max open files?

Richard Gooch (rgooch@atnf.CSIRO.AU)
Sat, 4 Jan 1997 12:23:04 +1100

Richard B. Johnson writes:
> On Fri, 3 Jan 1997, Baldur Norddahl wrote:
> > On Fri, 3 Jan 1997, Richard B. Johnson wrote:
> >
> > > Now, it is not efficient to kick-off a separate child to handle each
> > > connection. It is also not efficient to have a single task handle everything.
> > > There is some design necessary to figure out what goes in between.
> >
> > In the mud case it is actually efficient to keep everything in a single
> > process. At least that is what the profilers say. It is the simulation of
> > the virtual world that hogs the CPU, not the client handling. Having to
> > lock the objects in the virtual world would just complicate the simulation
> > and thereby make it slower.
> >
> > What teoratical background do you have for concluding that using a single
> > process to handle the clients is NEVER efficient?
> >
> The idea is that it is most efficient to have the kernel do as much work
> as possible. The presumption is that the kernel code in any operating system
> will be tuned to perform its function with the minimum possible overhead.
> Indeed, in many architectures, there are even opcodes that are not allowed
> in user-space programs, that are used by the operating system. In fact,
> the operating system can often set some bit in a page-register, rather
> than having to copy data, etc.

Well, that's one approach: do it all in the kernel (let's hear it
for kernel bloat). Another school of thought is that things which
don't *have* to be done in the kernel are done in userspace. This is
particularly true if the kernel is not multi-threaded. If you push the
work for managing large numbers of connections into the kernel, then
that may give an application an "unfair" advantage over other
applications which are CPU bound. By pushing the work into userland,
it give the kernel the opportunity to context switch the process out
and give someone else a go.

> Of course we will always have user-mode programmers who think that they
> can make better code than the kernel code, but you should know how that
> works.

That assumes the correct approach requires the absolute tightest
code, preferably written in assembler, and so forth. See below.

> When user code has to keep track of many "sockets" it usually has to look
> through a list (perhaps linked) of things to be done once some event
> (such as an inquiry from a socket-connected client), It can't just use
> socket values as indexes because clients disconnect and new out-of-order
> sockets are assigned for new connections.

Most *traditional* approaches to socket management employ a linked
list where the sockets are listed in order. In that case, a sequential
search is required, which can be quite painful for large lists. Now,
quoting your belief that better algorithm design is the correct
approach, here are few improvements:

1) construct an array of (void *) entries. The FD is an index into
this array. Very fast lookup. A NULL value indicates no entry,
otherwise it's a pointer to your socket management structure. It's
irrelevant whether or not sockets close: just dellocate the structure
and set the pointer to NULL. When the OS gives you a FD which is
bigger than your array length, reallocate the array to twice it's
size. Simple

2) Instead of a simple linked list, use a _binary tree_! Wow. New
concept. When you return from select(2), walk down the tree and pick
up your socket management structure. A mere 20 iterations for a
million FDs. Lots of lost cycles there

> Once the list becomes large, much time is wasted just getting to the
> code that is going to service the request. There might even be a context-
> switch or two before your application actually does anything useful as
> far as the client is concerned.

I don't see why you're so convinced that a process managing thousands
of FDs is inefficient. You talk about better algorithm design, and yet
you don't seem to consider a few simple, efficient approaches to
solving the problem in userland. Anyone who has large lists to
maintain has had to solve this problem.

> Now, suppose your code used a different "port" (after some initial
> negotiation), for each Client. Then suppose your code wasn't even
> executed until the kernel gave you control with the port (read index),
> already found.
> Don't you think that this would be a more efficient way to handle the
> stereotypical Client/Server methodology?

Nope. Your example is wasteful of system resources. Particularly
RAM. It costs kernel memory (RAM) for each process/task. Say it costs
512 bytes for an entry in the process table. Let's say the process
limit on FDs is 256. Say each connection requires two FDs (the
connection and perhaps an open disc file). That limts a process to 128
connections. To support 20 000 connections, you need 157
processes. This takes nearly 80 kBytes of RAM; memory which can't be
swapped. This number is nearly doubled on 64 bit machines. On top of
this you need to add the RAM it takes for each FD.
Also it will take the kernel time to walk though the runnable process
list. Then there's added overheads dealing with the page table.

> Now, this is just one example. It is not a good example but it is one
> that is easy to understand. Another example is the simple telnet daemon.
> It issues an "listen", and when a "connect" occurs, it sets up I/O
> descriptors, spawns a child, then goes back to sleep. The child is
> entirely independent "setsid()", uses its own resources, etc. This
> child only worries about one client and has, in fact, become the client.
> Until the number of "connections" reaches some point, this seems very
> efficient. However, memory is being wasted because each of the children
> (now users), have a lot of code that is only used once. At some point,
> having separate children perform tasks on behalf of separate Clients is
> no longer efficient because of the wasted overhead. Note that the telnet
> example could be accessing a database or serving files instead of being
> a terminal server to a shell.

Did you know that Solaris 2 has a kernel-level "telnetd" for the
express reason of reducing the number of processes on the system?
Because hundreds of "telnetd" processes load the system. Each of those
"telnetd" processes do comparatively little work (compared to the
shell the user is running). A single process/task can do the work of
hundreds, leaving the kernel so schedule the more important jobs:
users' shells.
They've learnt that too many processes is a bad thing.