Re: How to increat [sic.] max open files?

Richard Gooch (rgooch@atnf.CSIRO.AU)
Sun, 5 Jan 1997 10:22:08 +1100

Richard B. Johnson writes:
> On Sat, 4 Jan 1997, Richard Gooch wrote:
> > Richard B. Johnson writes:
> > > On Fri, 3 Jan 1997, Baldur Norddahl wrote:
> > > [SNIPPED]
> > > > On Fri, 3 Jan 1997, Richard B. Johnson wrote:
> > > [SNIPPED]
> > > >
> >
> > Well, that's one approach: do it all in the kernel (let's hear it
> > for kernel bloat). Another school of thought is that things which
> > don't *have* to be done in the kernel are done in userspace. This is
> > particularly true if the kernel is not multi-threaded. If you push the
> > work for managing large numbers of connections into the kernel, then
> > that may give an application an "unfair" advantage over other
> > applications which are CPU bound. By pushing the work into userland,
> > it give the kernel the opportunity to context switch the process out
> > and give someone else a go.
> Very true! That's one of the reasons why I don't think a server should be
> a "workstation" at the same time.

In an ideal world. However, budgetary concerns often lead to the
general-purpose login/compute server also running httpd. It doesn't
work to say "you really shouldn't do this: it's not the Right Way(tm)
of doing things, so you'll just have to suffer". What people want to
hear is that you've improved performance for their application mix.

> > > Of course we will always have user-mode programmers who think that they
> > > can make better code than the kernel code, but you should know how that
> > > works.
> >
> > That assumes the correct approach requires the absolute tightest
> > code, preferably written in assembler, and so forth. See below.
> >
> > > When user code has to keep track of many "sockets" it usually has to look
> > > through a list (perhaps linked) of things to be done once some event
> > > (such as an inquiry from a socket-connected client), It can't just use
> > > socket values as indexes because clients disconnect and new out-of-order
> > > sockets are assigned for new connections.
> >
> > Most *traditional* approaches to socket management employ a linked
> > list where the sockets are listed in order. In that case, a sequential
> > search is required, which can be quite painful for large lists. Now,
> > quoting your belief that better algorithm design is the correct
> > approach, here are few improvements:
> >
> > 1) construct an array of (void *) entries. The FD is an index into
> > this array. Very fast lookup. A NULL value indicates no entry,
> > otherwise it's a pointer to your socket management structure. It's
> > irrelevant whether or not sockets close: just dellocate the structure
> > and set the pointer to NULL. When the OS gives you a FD which is
> > bigger than your array length, reallocate the array to twice it's
> > size. Simple
> >
> > 2) Instead of a simple linked list, use a _binary tree_! Wow. New
> > concept. When you return from select(2), walk down the tree and pick
> > up your socket management structure. A mere 20 iterations for a
> > million FDs. Lots of lost cycles there
> >
> > > Once the list becomes large, much time is wasted just getting to the
> > > code that is going to service the request. There might even be a context-
> > > switch or two before your application actually does anything useful as
> > > far as the client is concerned.
> >
> > I don't see why you're so convinced that a process managing thousands
> > of FDs is inefficient. You talk about better algorithm design, and yet
> > you don't seem to consider a few simple, efficient approaches to
> > solving the problem in userland. Anyone who has large lists to
> > maintain has had to solve this problem.
> >
> Note that we are both saying the same thing. I think we are just disagreeing
> upon how to say it.

Not quite: your approach is to push it into the kernel, whereas mine
is to push it into userland. You used an argument about slow searching
of large lists in a single process as a justification for saying that
the jobs should be split into many processes, so this searching would
take less time. In fact, that approach (apart from pushing work into
the kernel), is not necessarily any more efficient. Consider the case
where there is a lot of FD activity, so much so that each of your
processes is woken up from select. In that case each process has to
traverse it's linked list, so you end up with the same amount of list
traversal being done, it's just that hundreds of processes have to be
context switched to do this. This will make the machine much less
responsive for other users.
Furthermore, your approach of splitting the work into many processes
(say 20 000 connections, 128 connections per process, 157 processes)
is actually *less efficient* if the more efficient binary tree is
adopted. A single process can walk the tree in ~14 steps. Each of your
processes will take 7 steps, but multiply that by 157 and you get 1099
steps: more total user-mode CPU time. You then still have to add the
costs of kernel context switching.

I contend that the correct approach is a single userland process.

> > > Now, suppose your code used a different "port" (after some initial
> > > negotiation), for each Client. Then suppose your code wasn't even
> > > executed until the kernel gave you control with the port (read index),
> > > already found.
> > >
> > > Don't you think that this would be a more efficient way to handle the
> > > stereotypical Client/Server methodology?
> >
> > Nope. Your example is wasteful of system resources. Particularly
> > RAM. It costs kernel memory (RAM) for each process/task. Say it costs
> > 512 bytes for an entry in the process table. Let's say the process
> > limit on FDs is 256. Say each connection requires two FDs (the
> > connection and perhaps an open disc file). That limts a process to 128
> > connections. To support 20 000 connections, you need 157
> > processes. This takes nearly 80 kBytes of RAM; memory which can't be
> > swapped. This number is nearly doubled on 64 bit machines. On top of
> > this you need to add the RAM it takes for each FD.
> > Also it will take the kernel time to walk though the runnable process
> > list. Then there's added overheads dealing with the page table.
> >
> The initial assumption was (is) that the kernel is more efficient at this
> than user-mode code.

It may be true that in general, kernel code is more efficient than
userland code. However, if you push work into the kernel, and as a
result of that pushing you end up with more *total work* being done,
it is almost certain that your approach is less efficient.
I previously didn't even touch upon issues of cache invalidation and
so forth, which will punish your scheme even further.

> > > Now, this is just one example. It is not a good example but it is one
> > > that is easy to understand. Another example is the simple telnet daemon.
> > > ... no longer efficient because of the wasted overhead. Note that the
> telnet
> > > example could be accessing a database or serving files instead of being
> > > a terminal server to a shell.
> >
> > Did you know that Solaris 2 has a kernel-level "telnetd" for the
> > express reason of reducing the number of processes on the system?
> > Because hundreds of "telnetd" processes load the system. Each of those
> > "telnetd" processes do comparatively little work (compared to the
> > shell the user is running). A single process/task can do the work of
> > hundreds, leaving the kernel so schedule the more important jobs:
> > users' shells.
> Linux now has "nfsiod"..... sorta like, but for NFS..(separate tasks!!)
> The fact that Solaris does something just might mean that it's wrong. I
> have dealt with 6 years of SunBugs, every release making the machines slower
> and slower and slower and ....
> > They've learnt that too many processes is a bad thing.
> They probably have learned nothing.

I've seen enough of the limitations of Slowass 2 myself. However,
with this "shared" kernel telnetd, they have had the glimmerings of a
good idea. As Alan Cox pointed out, another approach is to have a
userland shared telnetd. The important point is that there is only
*one* telnetd running, only one process table entry.

If you really want the kernel to help in a situation where you have
20 000 connections, how about implementing the following:

a system call similar to select(2) or poll(2), except that instead of
the list of FDs being used for input to the kernel and return from the
kernel, have the input and return list separate. The return list is
written in a compact form, such that returned FDs are written in the
first N entries (where N is the number of returned FDs).

This would have the benefit that the application only has to walk
through a small list to find FDs which are ready for I/O. Right now,
the application has no choice but to walk through the entire list of
descriptors to find which are ready.
This scheme assumes that only a small fraction of all FDs are ready
for I/O, which is a fair assumption.