Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

From: Jens Axboe
Date: Fri May 31 2019 - 12:58:15 EST


On 5/31/19 10:02 AM, Roman Penyaev wrote:
> On 2019-05-31 16:48, Jens Axboe wrote:
>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>> Hi all,
>>>
>>> This is v3 which introduces pollable epoll from userspace.
>>>
>>> v3:
>>> - Measurements made, represented below.
>>>
>>> - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>> is added. (Linus)
>>>
>>> - Check pollflags explicitly on 0 inside work callback, and do
>>> nothing
>>> if 0.
>>>
>>> v2:
>>> - No reallocations, the max number of items (thus size of the user
>>> ring)
>>> is specified by the caller.
>>>
>>> - Interface is simplified: -ENOSPC is returned on attempt to add a
>>> new
>>> epoll item if number is reached the max, nothing more.
>>>
>>> - Alloced pages are accounted using user->locked_vm and limited to
>>> RLIMIT_MEMLOCK value.
>>>
>>> - EPOLLONESHOT is handled.
>>>
>>> This series introduces pollable epoll from userspace, i.e. user
>>> creates
>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>> header
>>> and ring pointers and then consumes ready events from a ring, avoiding
>>> epoll_wait() call. When ring is empty, user has to call epoll_wait()
>>> in order to wait for new events. epoll_wait() returns -ESTALE if user
>>> ring has events in the ring (kind of indication, that user has to
>>> consume
>>> events from the user ring first, I could not invent anything better
>>> than
>>> returning -ESTALE).
>>>
>>> For user header and user ring allocation I used vmalloc_user(). I
>>> found
>>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>>> dealing with page cache (like aio.c does). What is also nice is that
>>> virtual address is properly aligned on SHMLBA, thus there should not
>>> be
>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>
>> Why aren't we just adding support to io_uring for this instead? Then we
>> don't need yet another entirely new ring, that's is just a little
>> different from what we have.
>>
>> I haven't looked into the details of your implementation, just curious
>> if there's anything that makes using io_uring a non-starter for this
>> purpose?
>
> Afaict the main difference is that you do not need to recharge an fd
> (submit new poll request in terms of io_uring): once fd has been added
> to
> epoll with epoll_ctl() - we get events. When you have thousands of fds
> -
> that should matter.
>
> Also interesting question is how difficult to modify existing event
> loops
> in event libraries in order to support recharging (EPOLLONESHOT in terms
> of epoll).
>
> Maybe Azat who maintains libevent can shed light on this (currently I
> see
> that libevent does not support "EPOLLONESHOT" logic).

In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot. But there's no reason
why we can't have it persist until explicitly canceled with POLL_REMOVE.

--
Jens Axboe