Re: [PATCH] sysctl: add support for poll()

From: Eric W. Biederman
Date: Wed Aug 03 2011 - 14:08:24 EST


Lucas De Marchi <lucas.demarchi@xxxxxxxxxxxxxx> writes:

> On Wed, Aug 3, 2011 at 6:40 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>
>> Lucas De Marchi <lucas.demarchi@xxxxxxxxxxxxxx> writes:
>>
>> > @@ -176,6 +177,43 @@ static ssize_t proc_sys_write(struct file *filp, const char __user *buf,
>> > Â Â Â return proc_sys_call_handler(filp, (void __user *)buf, count, ppos, 1);
>> > Â}
>> >
>> > +static int proc_sys_open(struct inode *inode, struct file *filp)
>> > +{
>> > + Â Â struct ctl_table *table = PROC_I(inode)->sysctl_entry;
>> > +
>> > + Â Â if (table->poll) {
>> > + Â Â Â Â Â Â unsigned long event = atomic_read(&table->poll->event);
>> > +
>> > + Â Â Â Â Â Â filp->private_data = (void *)event;
>>
>> It would be nice if there was something that abstracted
>> filp->private_data accesses so they were type safe, and
>> clear what you were doing.
>
> private_data is used everywhere to save this private-per-inode data.
> Here it's just the current number of events. It doesn't matter the
> number, as long as when we wake up in the poll() we can compare these
> numbers.

It sure matters in code readablity if you have to cast the value
every time. A pair of light wrappers around filp->private_data
that abstracts away the cast would be useful.

>> > +static unsigned int proc_sys_poll(struct file *filp, poll_table *wait)
>> > +{
>> > + Â Â struct inode *inode = filp->f_path.dentry->d_inode;
>> > + Â Â struct ctl_table *table = PROC_I(inode)->sysctl_entry;
>> > + Â Â unsigned long event = (unsigned long)filp->private_data;
>> > + Â Â unsigned int ret = POLLIN | POLLRDNORM;
>>
>> The default return value here must be DEFAULT_POLLMASK otherwise you
>> change the result of poll for every single proc file and in general
>
> This is not true. This patch has nothing to do with proc files and
> will not change their return values. However I did use the same return
> values as used for proc files that support poll (see
> fs/proc/base.c:mounts_poll()

Apologies I meant /proc/sys/ files and it is absolutely true that
you are changing the poll behavior of every sysctl with this change.

/proc/mounts doesn't return writable because it doesn't support
writes. Most sysctl files support writes and thus should return
writeable.

>> return the wrong values because most sysctls are writable.
>
> For sysctl, there wasn't any poll support so how could I break
> something that didn't exist?

Not true. There is the generic support of poll that happens
when poll is not implemented. For a generic file that does
not implement specifically implement poll files return
both readable and writable.

Which ultimately is why we have to use POLLERR | POLLPRI
and not the more obvious values of readable or writable.

>> > + Â Â if (!table->proc_handler)
>> > + Â Â Â Â Â Â goto out;
>> > +
>> > + Â Â if (!table->poll)
>> > + Â Â Â Â Â Â goto out;
>> > +
>> > + Â Â poll_wait(filp, &table->poll->wait, wait);
>> > +
>> > + Â Â if (event != atomic_read(&table->poll->event)) {
>> > + Â Â Â Â Â Â filp->private_data = (void *)(unsigned long)atomic_read(
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â &table->poll->event);
>> > + Â Â Â Â Â Â ret = POLLIN | POLLRDNORM | POLLERR | POLLPRI;
>>
>> Why are you updating the event here instead of when the file is read?
>> Presumably we want until the file is read and people see the new value.
>
> This is to indicate, when we are polling the file , that the file
> should be read again. When the process wake up we compare the filp's
> copy of the event number with the one in table. Since they were the
> same when the file was opened, if they are different now it indicates
> the the file was updated and should be read again.
>
> There's no way to do this in read: we update table->poll->event
> whenever the table entry is changed and when wakening up from
> poll_wait() we compare if these values are the same.

What do you mean there is no way to do this in read?
That is how it is implemented in sysfs and it currently feels more
intuitive. There might be a semantic reason for not doing it in
read that I am missing but it is certainly technically possible.

>> > @@ -1011,6 +1012,12 @@ extern int proc_do_large_bitmap(struct ctl_table *, int,
>> > Â * cover common cases.
>> > Â */
>> >
>> > +/* Support for userspace poll() to watch for changes */
>> > +struct ctl_table_poll {
>> > + Â Â atomic_t event;
>> > + Â Â wait_queue_head_t wait;
>> > +};
>>
>> It is a moderately big stretch to think that everything that wants to
>> poll a value under /proc/sys will want to use struct ctl_table_poll.
>
> This is the way we know what processes are polling() this entry (hence
> the wait queue). It's a well known idiom and again maps to the
> existent pollable files in /proc

No. This is similar but this is not identical to what is done for every
file in proc. I agree we need something to flag that we are pollable.

>> > @@ -1291,6 +1292,7 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
>> > Â Â Â Â Â Â Â memset(u->domainname + len, 0, sizeof(u->domainname) - len);
>> > Â Â Â Â Â Â Â errno = 0;
>> > Â Â Â }
>> > + Â Â uts_proc_notify(UTS_PROC_DOMAINNAME);
>> > Â Â Â up_write(&uts_sem);
>> > Â Â Â return errno;
>> > Â}
>> > diff --git a/kernel/utsname_sysctl.c b/kernel/utsname_sysctl.c
>> > index a2cd77e..e96b766 100644
>> > --- a/kernel/utsname_sysctl.c
>> > +++ b/kernel/utsname_sysctl.c
>> > @@ -13,6 +13,7 @@
>> > Â#include <linux/uts.h>
>> > Â#include <linux/utsname.h>
>> > Â#include <linux/sysctl.h>
>> > +#include <linux/wait.h>
>> >
>> > Âstatic void *get_uts(ctl_table *table, int write)
>> > Â{
>> > @@ -51,12 +52,28 @@ static int proc_do_uts_string(ctl_table *table, int write,
>> > Â Â Â uts_table.data = get_uts(table, write);
>> > Â Â Â r = proc_dostring(&uts_table,write,buffer,lenp, ppos);
>> > Â Â Â put_uts(table, write, uts_table.data);
>> > +
>> > + Â Â if (write) {
>> > + Â Â Â Â Â Â atomic_inc(&table->poll->event);
>> > + Â Â Â Â Â Â wake_up_interruptible(&table->poll->wait);
>>
>> Duplicating code again? ÂWhy are you not calling your generic notify?
>
> Duplicating? The uts_proc_notify() was only created so when user calls
> hostname/domainname syscalls we wake up who is polling the
> correspondent sysctl entries.

> Here it's for the generic case in which we write to /proc/sysctl.

Except the routine you have changed isn't for the generic case it
is for a specific set of writes to domainname or hostname.

I agree we need code on the update the value through sysctl path.
However if you are going to go with a generic structure why not
implement a generic function to update that structure. Especially since
you have two copies of this logic to update the notify logic already.

>> > +static struct ctl_table_poll hostname_poll = {
>> > +   .event     Â= ATOMIC_INIT(0),
>> > +   .wait      = __WAIT_QUEUE_HEAD_INITIALIZER(hostname_poll.wait),
>> > +};
>> > +
>> > +static struct ctl_table_poll domainname_poll = {
>> > +   .event     Â= ATOMIC_INIT(0),
>> > +   .wait      = __WAIT_QUEUE_HEAD_INITIALIZER(domainname_poll.wait),
>> > +};
>>
>> No generic initialize for this generic structure? ÂInstead every user
>> must repeat the initialization?
>
> The wait queues are different, and the ".poll" is set to this. I can't
> see how we could generalize this. Maybe creating a macro
> CTL_TABLE_POLL_DECLARE() or something along the way, but I'm not sure
> it's worth doing.

If we are going to have a generic structure it is worth doing,
to reduce maintenance when someone changes things around and for
code clarity.

That initializer is just a hair better than line noise.

You can't seem to make up your mind here if you are implementing
a generic facility that should work for all sysctl entries or
if you are implementing a one off for hostname and nisdomainname.

For a one off what you have is ugly but who cares (except that you touch
the generic code). For a generic facility you are missing common
helpers.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/