Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

From: Jason Baron
Date: Mon Mar 02 2015 - 00:05:10 EST


On 02/27/2015 04:31 PM, Jonathan Corbet wrote:
> On Fri, 27 Feb 2015 13:10:34 -0800
> Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>> I don't really understand the need for rotation/round-robin. We can
>> solve the thundering herd via exclusive wakeups, but what is the point
>> in choosing to wake the task which has been sleeping for the longest
>> time? Why is that better than waking the task which has been sleeping
>> for the *least* time? That's probably faster as that task's data is
>> more likely to still be in cache.
> So here's my chance to show the world what a fool I am (again)... If I
> understand this at all, a task woken from epoll_wait() remains on the wait
> queues while it is off doing other stuff. If you're doing exclusive
> wakeups, the task at the head of the queue will get all of them, since it
> never gets removed from the queue. So you don't spread your load around,
> and, indeed, you may "wake" a process that is busy doing something else and
> can't deal with the event now anyway. You need some way to shuffle up the
> wait queue, and round-robin is probably as good as any.
>
> (The alternative would be to take the process off the queue until it calls
> epoll_wait() again, but that runs counter to what epoll is all about).
>
> At least, that was my impression when I took a look at this stuff.
>
> jon

So tasks do not remain on wait queues when they are not in
epoll_wait(). That is, tasks are added to the associated epoll
fd wait queue at the beginning of epoll_wait(), and removed
from the associated epoll fd wait queue when epoll_wait()
returns.

One can think about the problem, perhaps, as assigning fd
events - POLLIN, POLLLOUT, etc., to a set of tasks. And this
discussion is about how to do the assignment in certain
cases. Namely, one could start by partitioning the set of fds
into unique sets and then assigning them
(via EPOLL_CTL_ADD) to different epoll fds. Then, if there
is say a single task blocking on each epoll fd (via epoll_wait())
then each task can work nicely on its own set of events
without needing to necessarily co-ordinate with the other
tasks.

Now, this all works fine until, we have an 'fd' or event source
that we wish to share among all the tasks. We might
want to share it because it generates events or work, that
would potentially overwhelm a single task.

So in this shared event source case, where we have added
the fd to all of the epoll fds, we currently do a wake all.
This series attempts to change that behavior (with an
optional flag to epoll_create1()), into a round robin wakeup
(both to avoid excessive wakeups and to more evenly distribute
the wakeups). Note also that it will continue to wake up tasks,
as long as it doesn't find any in epoll_wait(). Thus, it still can
potentially wake up all if nobody is in epoll_wait().

Now, we could try and distribute the fd events among tasks
all waiting on a single epoll fd (meaning we have a single
event queue). But, we have already partitioned most of the
events, why combine them back into a single queue?

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/