Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifierinterface

From: Gregory Haskins
Date: Tue Jun 16 2009 - 10:41:24 EST


Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>
>> irqfd and its underlying implementation, eventfd, currently utilize
>> the embedded wait-queue in eventfd for signal notification. The nice thing
>> about this design decision is that it re-uses the existing
>> eventfd/wait-queue code and it generally works well....with several
>> limitations.
>>
>> One of the limitations is that notification callbacks are always called
>> inside a spin_lock_irqsave critical section. Another limitation is
>> that it is very difficult to build a system that can recieve release
>> notification without being racy.
>>
>> Therefore, we introduce a new registration interface that is SRCU based
>> instead of wait-queue based, and implement the internal wait-queue
>> infrastructure in terms of this new interface. We then convert irqfd
>> to use this new interface instead of the existing wait-queue code.
>>
>> The end result is that we now have the opportunity to run the interrupt
>> injection code serially to the callback (when the signal is raised from
>> process-context, at least) instead of always deferring the injection to a
>> work-queue.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@xxxxxxxxxx>
>> CC: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>> CC: Davide Libenzi <davidel@xxxxxxxxxxxxxxx>
>> ---
>>
>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>> include/linux/eventfd.h | 30 ++++++++++++
>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index 72f5f8d..505d5de 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>> */
>> __u64 count;
>> unsigned int flags;
>> + struct srcu_struct srcu;
>> + struct list_head nh;
>> + struct eventfd_notifier notifier;
>> };
>>
>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>> +{
>> + struct eventfd_ctx *ctx = container_of(en,
>> + struct eventfd_ctx,
>> + notifier);
>> +
>> + if (waitqueue_active(&ctx->wqh))
>> + wake_up_poll(&ctx->wqh, POLLIN);
>> +}
>> +
>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>> +{
>> + struct eventfd_notifier *en;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&ctx->srcu);
>> +
>> + /*
>> + * The goal here is to allow the notification to be preemptible
>> + * as often as possible. We cannot achieve this with the basic
>> + * wqh mechanism because it requires the wqh->lock. Therefore
>> + * we have an internal srcu list mechanism of which the wqh is
>> + * a client.
>> + *
>> + * Not all paths will invoke this function in process context.
>> + * Callers should check for suitable state before assuming they
>> + * can sleep (such as with preemptible()). Paul McKenney assures
>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>> + * the code within the critical section is also compatible.
>> + */
>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>> + en->ops->signal(en);
>> +
>> + srcu_read_unlock(&ctx->srcu, idx);
>> +}
>> +
>> /*
>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>> * success, or a value lower then "n" in case of coutner overflow.
>>
>
> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>
> Further, to do useful things it might not be enough that you can sleep:
> with iofd you also want to access current task with e.g. copy from user.
>

Something else to consider: For iosignalfd, we can assume we will
always be called from vcpu process context so we might not really need
official affirmation from the system. For irqfd, we cannot predict who
may be injecting the interrupt (for instance, it might be a
PCI-passthrough hard-irq context). I am not sure if this knowledge
actually helps to simplify the problem space, but I thought I should
mention it.

-Greg


Attachment: signature.asc
Description: OpenPGP digital signature