Re: Linux's implementation of poll() not scalable?

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Oct 24 2000 - 10:33:33 EST


[ Moving on to practical matters ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
>
> Might be good to pick more unique names than 'struct event' and 'get_event'.
> People probably use those already.

I agree. I would never implement them under those names, but it's easier
to talk about "event" than about something very specific.

> Hiding ev_list is probably ok. However,
> http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
> suggests that knowing how many events are pending is a useful measure
> of server load, and that if more than a certain number of events
> are pending, web servers should reject new connections. Thus it might
> be handy to make the return value of get_event be the number of events
> gotten.

Note that my "get_event()" function actually did return exactly that: it
returned the number of events, even though the actual event array handling
was hidden inside the library function.

> > So now you'd start everything off (assuming the same kind of "listen to
> > everything and react to it" server as in my previous example) by just
> > setting
> >
> > bind_event(sock, POLLIN, NULL, accept_fn);
>
> A couple questions:
>
> * how do you unbind? (By calling bind_event with NULL as the accept_fn?)

If done that way, I'd prefer something where an event _mask_ of 0 means
that the event obviously no longer exists, as it can no longer trigger.
The accept_fn would be part of the identifier, and as such zeroing it out
would lose the identification.

But I'd personally probably prefer to be more explicit, and have a
separate function for it.

Note that whether it's a separate function or just making "bind_event()"
really be "modify_or_bind_or_remove_event()" is pretty much not anything I
care about - it gets to be so low-level details that it would be something
that is probably best tried out and something that I don't have any
religious issues with.

I care about high-level interfaces, because _those_ are the ones that
screw me years down the line. For example, select() (and later bind()) was
always an interface that did _not_ fit into the UNIX way of doing things
AT ALL. It really stands out among all the IO interfaces as being very
different. And it's one of the nastier ones to maintain.

> * how do you change a mask? (By calling bind_event with a new mask?)

Same as above. Either bind with a new mask (zero being remove), or
separate call.

> * Is it ok to do these things while in an event_fn? (Yes?)

Oh, absolutely. The kernel doesn't even _know_ when something is in an
event function - the kernel only sees "change the event" and "get the list
of events".

> * Do you get an event whenever an fd is ready for something, or
> only when its readiness changes? (Presumably whenever an fd is ready for something?)

Only when its readiness changes - probably with the addition that it would
simplify things that a new event always (unconditionally) gets added to
the event queue.

Note that there are no races in this implementation - there is no way for
events to get lost, even trivially seen in SMP/threaded environments.
That's important.

The BSD kevent paper goes on about "level and edge triggered" and it
becomes a big thing for them, and they selected level-triggered events as
if it made any difference. And sure - it _does_ make a difference, but the
only difference is in how hard it is to implement, and level-triggered is
a noticeably harder.

The bind_event()/get_events() interface is edge-triggered, and works
fundamentally the same way as SIGCHLD. If you don't keep up, you get
SIGCHLD only once, even if ten children died on you in rapid succession.
And it's edge-triggered: that doesn't mean that you lose events (like the
BSD paper implies), it only means that you have to reap your children with
a loop:

        while ((err = waitpid(..)) >= 0) {
                .. get all children ..
        }

The reason "edge-triggered" (ie only when an event changes) is preferable
is that it's MUCH simpler, and means that the "get_events()" system call
doesn't need to actually understand what the events mean at all. It will
just take the events off the list. If anythig triggers after that, it will
be put back on the list again, but you can guarantee that you don't get
any lost events simply by the simple fact that

 - by the time get_events() returns to user mode (ie before the action
   functions have been called), any new event that happens even before we
   have had to react to the old events will cause the event to be queued
   up again.

What does this mean? It basically means that if you continue to get events
on a file descriptor during your event handler function, the event will
have moved back to the event list, and next time you _will_ get an event
again.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST