Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

From: Renzo Davoli
Date: Mon May 27 2019 - 09:39:48 EST


On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> > This patch implements an extension of eventfd to define file descriptors
> > whose I/O events can be generated at user level. These file descriptors
> > trigger notifications for [p]select/[p]poll/epoll.
> >
> > This feature is useful for user-level implementations of network stacks
> > or virtual device drivers as libraries.
>
> How can this be used to create a "virtual device driver"? Do you have
> any examples of this new interface being used anywhere?

Networking programs use system calls implementing the Berkeley sockets API:
socket, accept, connect, listen, recv*, send* etc. Programs dealing with a
device use system calls like open, read, write, ioctl etc.

When somebody wants to write a library able to behave like a network stack (say
lwipv6, picotcp) or a device, they can implement functions like my_socket,
my_accept, my_open or my_ioctl, as drop-in replacement of their system
call counterpart. (It is also possible to use dynamic library magic to
rename/divert the system call requests to use their 'virtual'
implementation provided by the library: socket maps to my_socket, recv
to my_recv etc).

In this way portability and compatibility is easier, using a well known API
instead of inventing new ones.

Unfortunately this approach cannot be applied to
poll/select/ppoll/pselect/epoll. These system calls can refer at the same time
to file descriptors created by 'real' system calls like socket, open, signalfd...
and to file descriptors returned by my_open, your_socket.

>
> Also, meta-comment, you should provide some sort of test to kselftests
> for your new feature so that it can actually be tested, as well as a man
> page update (separately).
Sure. I'll do it ASAP, let me collect suggestions first.

>
> > Development and porting of code often requires to find the way to wait for I/O
> > events both coming from file descriptors and generated by user-level code (e.g.
> > user-implemented net stacks or drivers). While it is possible to provide a
> > partial support (e.g. using pipes or socketpairs), a clean and complete
> > solution is still missing (as far as I have seen); e.g. I have not seen any
> > clean way to generate EPOLLPRI, EPOLLERR, etc.
>
> What's wrong with pipes or sockets for stuff like this? Why is epoll
> required?
Example:
suppose there is an application waiting for a TCP OOB message. It uses poll to wait
for POLLPRI and then reads the message (e.g. by 'recv').
If I want to port that application to use a network stack implemented as a library
I have to rewrite the code about 'poll' as it is not possible to receive a POLLPRI.
>From a pipe I can just receive a POLLIN, I have to encode in an external data structure
any further information.
Using EFD_VPOLL the solution is straightforward: the function mysocket (used in place
of socket to create a file descripor behaving as a 'real'socket) returns a file
descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left
unmodified in the code. When the OOB message is available the library can trigger
an EPOLLPRI and the message can be received using my_recv.

>
...omissis...
> >
> > Signed-off-by: Renzo Davoli <renzo@xxxxxxxxxxx>
> > ---
> > fs/eventfd.c | 115 +++++++++++++++++++++++++++++++--
> > include/linux/eventfd.h | 7 +-
> > include/uapi/linux/eventpoll.h | 2 +
> > 3 files changed, 116 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index 8aa0ea8c55e8..f83b7d02307e 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -3,6 +3,7 @@
> > * fs/eventfd.c
> > *
> > * Copyright (C) 2007 Davide Libenzi <davidel@xxxxxxxxxxxxxxx>
> > + * EFD_VPOLL support: 2019 Renzo Davoli <renzo@xxxxxxxxxxx>
>
> No need for this line, that's what the git history shows.
okay

>
> > *
> > */
> >
> > @@ -30,12 +31,24 @@ struct eventfd_ctx {
> > struct kref kref;
> > wait_queue_head_t wqh;
> > /*
> > - * Every time that a write(2) is performed on an eventfd, the
> > - * value of the __u64 being written is added to "count" and a
> > - * wakeup is performed on "wqh". A read(2) will return the "count"
> > - * value to userspace, and will reset "count" to zero. The kernel
> > - * side eventfd_signal() also, adds to the "count" counter and
> > - * issue a wakeup.
> > + * If the EFD_VPOLL flag was NOT set at eventfd creation:
> > + * Every time that a write(2) is performed on an eventfd, the
> > + * value of the __u64 being written is added to "count" and a
> > + * wakeup is performed on "wqh". A read(2) will return the "count"
> > + * value to userspace, and will reset "count" to zero (or decrement
> > + * "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
> > + * side eventfd_signal() also, adds to the "count" counter and
> > + * issue a wakeup.
> > + *
> > + * If the EFD_VPOLL flag was set at eventfd creation:
> > + * count is the set of pending EPOLL events.
> > + * read(2) returns the current value of count.
> > + * The argument of write(2) is an 8-byte integer:
> > + * it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
> > + * EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
> > + * events to be added, deleted to the current set of pending events.
> > + * (i.e. which bits of "count" must be set or reset).
> > + * EFD_VPOLL_MODEVENTS redefines the set of pending events.
>
> Ugh, overloading stuff, this is increased complexity, do you _have_ to
> do it this way?
There can be other approaches: e.g. two specific new system calls like "vpollfd_create" and "vpollfd_ctl".
Their signature could be:
int vpollfd_create(unsigned int init_events, int flags);
where flags are the usual NONBLOCK/CLOEXEC
int vpollfd_ctl(int fd, int op, unsigned int events);
where op can be VPOLL_ADDEVENTS, VPOLL_DELEVENTS, VPOLL_MODEVENTS

It possible to reimplement the patch this way. It needs the definition of the new system calls.
I am proposing just a new tag for eventfd as eventfd purpose is conceptually close to the new feature.
Eventfd creates a file descriptor which generates events. The default eventfd mode uses counters while
EFD_VPOLL uses event flags. The new feature can be implemented on eventfd with a very limited
impact on the kernel core code.
Instead of syscalls, the vpollfd_create/vpollfd_ctl API could be provided by the glibc as (very simple)
library functions, as it is the case for eventfd_read/eventfd_write in /usr/include/sys/eventfd.h

It seems to me that EFD_VPOLL is just a different MODE of the same system call, not a real overloading.
Something similar happened to seccomp: the original implementation by Arcangeli is now
MODE_STRICT, the MODE_FILTER has been added in a second time as an extension.

>
> > */
> > __u64 count;
> > unsigned int flags;
> > @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
> > return res;
> > }
> >
> > +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
> > +{
> > + struct eventfd_ctx *ctx = file->private_data;
> > + __poll_t events = 0;
> > + u64 count;
> > +
> > + poll_wait(file, &ctx->wqh, wait);
> > +
> > + count = READ_ONCE(ctx->count);
> > +
> > + events = (count & EPOLLALLMASK);
>
> Why mask?
Because the four most significant bits are not events but policy modifiers (mainly for
multithreading support):
#define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28))
#define EPOLLWAKEUP ((__force __poll_t)(1U << 29))
#define EPOLLONESHOT ((__force __poll_t)(1U << 30))
#define EPOLLET ((__force __poll_t)(1U << 31))
a file_operations poll function implementation should never generate these, isn't it?
>
> > +
> > + return events;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct eventfd_ctx *ctx = file->private_data;
> > + ssize_t res;
> > + __u64 ucnt = 0;
> > +
> > + if (count < sizeof(ucnt))
> > + return -EINVAL;
>
> What is magic about the size of a __u64 here?
Just for consistency with the non EFD_VPOLL eventfd.

>
> > + res = sizeof(ucnt);
> > + ucnt = READ_ONCE(ctx->count);
> > + if (put_user(ucnt, (__u64 __user *)buf))
> > + return -EFAULT;
> > +
> > + return res;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct eventfd_ctx *ctx = file->private_data;
> > + ssize_t res;
> > + __u64 ucnt;
> > + __u32 events;
> > +
> > + if (count < sizeof(ucnt))
> > + return -EINVAL;
>
> Why can it not be less than 64?
This is the imeplementation of 'write'. The 64 bits include the 'command'
EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the most
significant 32 bits) and the set of events (in the lowest 32 bits).

>
> > + if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
> > + return -EFAULT;
> > + spin_lock_irq(&ctx->wqh.lock);
> > +
> > + events = ucnt & EPOLLALLMASK;
> > + res = sizeof(ucnt);
> > + switch (ucnt & ~((__u64)EPOLLALLMASK)) {
> > + case EFD_VPOLL_ADDEVENTS:
> > + ctx->count |= events;
> > + break;
> > + case EFD_VPOLL_DELEVENTS:
> > + ctx->count &= ~(events);
> > + break;
> > + case EFD_VPOLL_MODEVENTS:
> > + ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
> > + break;
> > + default:
> > + res = -EINVAL;
> > + }
> > +
> > + /* wake up waiting threads */
> > + if (res >= 0 && waitqueue_active(&ctx->wqh))
> > + wake_up_locked_poll(&ctx->wqh, res);
>
> Can you call this with a spinlock held? I really don't remember, sorry,
> if so, nevermind, but you should check...

I would have done the same objection. eventfd_vpoll_write uses the same pattern
of code of eventfd_write, the difference is in the cause to unblock processes:
counter values vs. events. So this is as correct as eventfd_write is, isn't it? :)

>
> > +
> > + spin_unlock_irq(&ctx->wqh.lock);
> > +
> > + return res;
> > +
> > +}
> > +
> > #ifdef CONFIG_PROC_FS
> > static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
> > {
> > @@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
> > .llseek = noop_llseek,
> > };
> >
> > +static const struct file_operations eventfd_vpoll_fops = {
> > +#ifdef CONFIG_PROC_FS
> > + .show_fdinfo = eventfd_show_fdinfo,
> > +#endif
> > + .release = eventfd_release,
> > + .poll = eventfd_vpoll_poll,
> > + .read = eventfd_vpoll_read,
> > + .write = eventfd_vpoll_write,
> > + .llseek = noop_llseek,
> > +};
> > +
> > /**
> > * eventfd_fget - Acquire a reference of an eventfd file descriptor.
> > * @fd: [in] Eventfd file descriptor.
> > @@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
> > static int do_eventfd(unsigned int count, int flags)
> > {
> > struct eventfd_ctx *ctx;
> > + const struct file_operations *fops = &eventfd_fops;
> > int fd;
> >
> > /* Check the EFD_* constants for consistency. */
> > @@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
> > ctx->flags = flags;
> > ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
> >
> > - fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> > + if (flags & EFD_VPOLL) {
> > + fops = &eventfd_vpoll_fops;
> > + ctx->count &= EPOLLALLMASK;
> > + }
> > + fd = anon_inode_getfd("[eventfd]", fops, ctx,
> > O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
> > if (fd < 0)
> > eventfd_free_ctx(ctx);
> > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> > index ffcc7724ca21..63258cf29344 100644
> > --- a/include/linux/eventfd.h
> > +++ b/include/linux/eventfd.h
> > @@ -21,11 +21,16 @@
> > * shared O_* flags.
> > */
> > #define EFD_SEMAPHORE (1 << 0)
> > +#define EFD_VPOLL (1 << 1)
>
> BIT(1)?
It is for the sake of consistency with the rest of the header file,
I can change also the line above:
#define EFD_SEMAPHORE BIT(0)
#define EFD_VPOLL BIT(1)
The value has been chosen to fullfil the request written just above the patched statement:
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
* new flags, since they might collide with O_* ones. We want
* to re-use O_* flags that couldn't possibly have a meaning
* from eventfd, in order to leave a free define-space for
* shared O_* flags.
*/
EFD_SEMAPHORE uses the same value of O_WRONLY, I decided to use the same value of O_RDWR.
Both should not have a meaning from eventfd.

>
> > #define EFD_CLOEXEC O_CLOEXEC
> > #define EFD_NONBLOCK O_NONBLOCK
> >
> > #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> > -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> > +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
> > +
> > +#define EFD_VPOLL_ADDEVENTS (1UL << 32)
> > +#define EFD_VPOLL_DELEVENTS (2UL << 32)
> > +#define EFD_VPOLL_MODEVENTS (3UL << 32)
>
> Aren't these part of the uapi? Why are they hidden in here?

You are right. There is not a uapi/linux/eventfd.h. EFD_SEMAPHORE is here, so for
consistency I added EFD_VPOLL/EFD_VPOLL_* here, too. (glibc provides the value of EFD_SEMAPHORE
in /usr/include/bits/eventfd.h).

>
> >
> > struct eventfd_ctx;
> > struct file;
> > diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> > index 8a3432d0f0dc..814de6d869c7 100644
> > --- a/include/uapi/linux/eventpoll.h
> > +++ b/include/uapi/linux/eventpoll.h
> > @@ -41,6 +41,8 @@
> > #define EPOLLMSG (__force __poll_t)0x00000400
> > #define EPOLLRDHUP (__force __poll_t)0x00002000
> >
> > +#define EPOLLALLMASK ((__force __poll_t)0x0fffffff)
>
> Why is this part of the uapi?
It can be useful. As I have explained above it permits to split betweeen
event bits and behavioral flags in poll_t.

> thanks,
> greg k-h

many thanks to you
renzo