Re: [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_getinterface

From: Gregory Haskins
Date: Mon Jun 22 2009 - 14:05:36 EST


Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 01:31:29PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 22, 2009 at 12:05:57PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> This patch fixes all known races in irqfd, and paves the way to restore
>>>> DEASSIGN support. For details of the eventfd races, please see the patch
>>>> presumably commited immediately prior to this one.
>>>>
>>>> In a nutshell, we use eventfd_kref_get/put() to properly manage the
>>>> lifetime of the underlying eventfd. We also use careful coordination
>>>> with our workqueue to ensure that all irqfd objects have terminated
>>>> before we allow kvm to shutdown. The logic used for shutdown walks
>>>> all open irqfds and releases them. This logic can be generalized in
>>>> the future to allow a subset of irqfds to be released, thus allowing
>>>> DEASSIGN support.
>>>>
>>>> Signed-off-by: Gregory Haskins <ghaskins@xxxxxxxxxx>
>>>>
>>>>
>>> I think this patch is a shade too tricky. Some explanation why below.
>>>
>>> But I think irqfd_pop is a good idea.
>>>
>>>
>> Yeah, next we can add something like "irqfd_remove(gsi)" in a similar
>> way to do DEASSIGN.
>>
>>
>>> Here's an alternative design sketch: add a list of irqfds to be shutdown
>>> in kvm, and create a single-threaded workqueue. To kill an irqfd, move
>>> it from list of live irqfds to list of dead irqfds, then schedule work
>>> on a workqueue that walks this list and kills irqfds.
>>>
>>>
>> Yeah, I actually thought of that too, and I think that will work. But
>> then I realized flush_schedule_work does the same thing and its much
>> less code. Perhaps it is also much less clear, too ;) At the very
>> least, you have made me realize I need to comment better.
>>
>
> Not really, it's impossible to document all races one have thought
> about and avoided.
>

Heh, that is a very astute observation.

>
>>>
>>>
>>>> ---
>>>>
>>>> virt/kvm/eventfd.c | 144 ++++++++++++++++++++++++++++++++++++++++------------
>>>> 1 files changed, 110 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>>>> index 9656027..67985cd 100644
>>>> --- a/virt/kvm/eventfd.c
>>>> +++ b/virt/kvm/eventfd.c
>>>> @@ -28,6 +28,7 @@
>>>> #include <linux/file.h>
>>>> #include <linux/list.h>
>>>> #include <linux/eventfd.h>
>>>> +#include <linux/kref.h>
>>>>
>>>> /*
>>>> * --------------------------------------------------------------------
>>>> @@ -36,26 +37,68 @@
>>>> * Credit goes to Avi Kivity for the original idea.
>>>> * --------------------------------------------------------------------
>>>> */
>>>> +
>>>> +enum {
>>>> + irqfd_flags_shutdown,
>>>> +};
>>>> +
>>>> struct _irqfd {
>>>> struct kvm *kvm;
>>>> + struct kref *eventfd;
>>>>
>>>>
>>> Yay, kref.
>>>
>>>
>>>
>>>> int gsi;
>>>> struct list_head list;
>>>> poll_table pt;
>>>> wait_queue_head_t *wqh;
>>>> wait_queue_t wait;
>>>> - struct work_struct inject;
>>>> + struct work_struct work;
>>>> + unsigned long flags;
>>>>
>>>>
>>> Just make it "int shutdown"?
>>>
>>>
>> Yep, that is probably fine but we will have to use an explicit wmb in
>> lieu of a set_bit operation. NBD.
>>
>>
>>>
>>>
>>>> };
>>>>
>>>> static void
>>>> -irqfd_inject(struct work_struct *work)
>>>> +irqfd_release(struct _irqfd *irqfd)
>>>> +{
>>>> + eventfd_kref_put(irqfd->eventfd);
>>>> + kfree(irqfd);
>>>> +}
>>>> +
>>>> +static void
>>>> +irqfd_work(struct work_struct *work)
>>>> {
>>>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>>>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
>>>> struct kvm *kvm = irqfd->kvm;
>>>>
>>>> - mutex_lock(&kvm->irq_lock);
>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> - mutex_unlock(&kvm->irq_lock);
>>>> + if (!test_bit(irqfd_flags_shutdown, &irqfd->flags)) {
>>>>
>>>>
>>> Why is it safe to test this bit outside of any lock?
>>>
>>>
>> Because the ordering is guaranteed to set_bit(), schedule_work(). All
>> we need to do is make sure that the work-queue runs at least one more
>> time after the flag has been set. (Of course, I could have screwed up
>> too, but that was my rationale).
>>
>>
>>>
>>>
>>>> + /* Inject an interrupt */
>>>> + mutex_lock(&kvm->irq_lock);
>>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> + mutex_unlock(&kvm->irq_lock);
>>>> + } else {
>>>>
>>>>
>>> Not much shared code here - create a separate showdown work struct?
>>> They are cheap ...
>>>
>>>
>> We can't because we need to ensure that all inject-jobs complete before
>> release-jobs. Reading the work-queue code, it would be a deadlock for
>> the release-job to do a flush_work(inject-job). Therefore, both
>> workloads are encapsulated into a single job, and we ensure that the job
>> is launched at least one more time after the flag has been set.
>>
>
> AFAIK schedule_work does not give you in-order guarantees - it's
> multithreaded. you will have to create a single-threaded workqueue
> if you want in order execution.
>

Right, that was my understanding as well. Thats why I do both tasks
from a single work-item ;)

>
>> Of course, now that I wrote that, I realize it was clear-as-mud in the
>> code and needs some commenting ;)
>>
>>
>>>
>>>
>>>> + /* shutdown the irqfd */
>>>> + struct _irqfd *_irqfd = NULL;
>>>> +
>>>> + mutex_lock(&kvm->lock);
>>>> +
>>>> + if (!list_empty(&irqfd->list))
>>>> + _irqfd = irqfd;
>>>> +
>>>> + if (_irqfd)
>>>> + list_del(&_irqfd->list);
>>>> +
>>>> + mutex_unlock(&kvm->lock);
>>>> +
>>>> + /*
>>>> + * If the item is not currently on the irqfds list, we know
>>>> + * we are running concurrently with the KVM side trying to
>>>> + * remove this item as well.
>>>>
>>>>
>>> We do? How? As far as I can see list is only empty after it has been
>>> created. Generally, it would be better to either use a flag or use
>>> list_empty as an indication of going down, but not both.
>>>
>>>
>> I think you are mis-reading that. list_empty(&irqfd->list) is the
>> individual irqfd list-item, not the kvm->irqfds list itself. This
>> conditional is telling us whether the irqfd in question is on or off the
>> list (its effectively an irqfd-specific flag), not whether the global
>> list is empty. Again, poor commenting on my part.
>>
>
> Yes, but you do INIT_LIST_HEAD in a single place. Once you add
> irqfd->list to a list, it won't be empty until you init it again.
>

Good point. I need list_del_init() and then it would work, right?

>
>>>
>>>
>>>> Since the KVM side should be
>>>> + * holding the reference now, and will block behind a
>>>> + * flush_work(), lets just let them do the release() for us
>>>> + */
>>>> + if (!_irqfd)
>>>> + return;
>>>> +
>>>> + irqfd_release(_irqfd);
>>>> + }
>>>> }
>>>>
>>>> static int
>>>> @@ -65,25 +108,20 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>> unsigned long flags = (unsigned long)key;
>>>>
>>>> /*
>>>> - * Assume we will be called with interrupts disabled
>>>> + * called with interrupts disabled
>>>> */
>>>> - if (flags & POLLIN)
>>>> - /*
>>>> - * Defer the IRQ injection until later since we need to
>>>> - * acquire the kvm->lock to do so.
>>>> - */
>>>> - schedule_work(&irqfd->inject);
>>>> -
>>>> if (flags & POLLHUP) {
>>>> /*
>>>> - * for now, just remove ourselves from the list and let
>>>> - * the rest dangle. We will fix this up later once
>>>> - * the races in eventfd are fixed
>>>> + * ordering is important: shutdown flag must be visible
>>>> + * before we schedule
>>>> */
>>>> __remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> - irqfd->wqh = NULL;
>>>> + set_bit(irqfd_flags_shutdown, &irqfd->flags);
>>>>
>>>>
>>> So what happens if a previously scheduled work runs on irqfd
>>> and sees this flag?
>>>
>> My original thought was "thats ok", but now that you mention it I am not
>> so sure. Ill give it some more thought because maybe you are on to
>> something.
>>
>>
>>> And note that multiple works can run on irqfd
>>> in parallel.
>>>
>>>
>> They can? I thought work-queue items were guaranteed to only schedule
>> once? If what you say is true, its broken, I agree, and Ill need to
>> revisit. Let me get back to you.
>>
>>>
>>>
>>>> }
>>>>
>>>> + if (flags & (POLLHUP | POLLIN))
>>>> + schedule_work(&irqfd->work);
>>>> +
>>>> return 0;
>>>> }
>>>>
>>>> @@ -102,6 +140,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> {
>>>> struct _irqfd *irqfd;
>>>> struct file *file = NULL;
>>>> + struct kref *kref = NULL;
>>>> int ret;
>>>> unsigned int events;
>>>>
>>>> @@ -112,7 +151,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> irqfd->kvm = kvm;
>>>> irqfd->gsi = gsi;
>>>> INIT_LIST_HEAD(&irqfd->list);
>>>> - INIT_WORK(&irqfd->inject, irqfd_inject);
>>>> + INIT_WORK(&irqfd->work, irqfd_work);
>>>>
>>>> file = eventfd_fget(fd);
>>>> if (IS_ERR(file)) {
>>>> @@ -133,11 +172,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> list_add_tail(&irqfd->list, &kvm->irqfds);
>>>> mutex_unlock(&kvm->lock);
>>>>
>>>> - /*
>>>> - * Check if there was an event already queued
>>>> - */
>>>> - if (events & POLLIN)
>>>> - schedule_work(&irqfd->inject);
>>>> + kref = eventfd_kref_get(file);
>>>> + if (IS_ERR(file)) {
>>>> + ret = PTR_ERR(file);
>>>> + goto fail;
>>>> + }
>>>> +
>>>> + irqfd->eventfd = kref;
>>>>
>>>> /*
>>>> * do not drop the file until the irqfd is fully initialized, otherwise
>>>> @@ -145,9 +186,18 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> */
>>>> fput(file);
>>>>
>>>> + /*
>>>> + * Check if there was an event already queued
>>>> + */
>>>>
>>>>
>>> This comment seems to confuse more that it clarifies:
>>> queued where? eventfd only counts... Just kill the comment?
>>>
>>>
>>>
>> non-zero values in eventfd are "queued" as a signal. This test just
>> checks if an interrupt was already injected before we registered.
>>
>
> After have understood the code I see what you mean, but the comment
> wasn't helpful and is better left out.
>

Ok. What if I say "Check if an interrupt is already pending before we
registered the callback" ;)

>
>>>> + if (events & POLLIN)
>>>> + schedule_work(&irqfd->work);
>>>> +
>>>> return 0;
>>>>
>>>> fail:
>>>> + if (kref && !IS_ERR(kref))
>>>> + eventfd_kref_put(kref);
>>>> +
>>>> if (file && !IS_ERR(file))
>>>> fput(file);
>>>>
>>>>
>>> let's add a couple more labels and avoid the kref/file check
>>> and the initialization above?
>>>
>>>
>> I think that just makes it more confusing, personally. But I will give
>> it some thought.
>>
>>
>>>
>>>
>>>>
>>>> @@ -161,21 +211,47 @@ kvm_irqfd_init(struct kvm *kvm)
>>>> INIT_LIST_HEAD(&kvm->irqfds);
>>>> }
>>>>
>>>> +static struct _irqfd *
>>>> +irqfd_pop(struct kvm *kvm)
>>>> +{
>>>> + struct _irqfd *irqfd = NULL;
>>>> +
>>>> + mutex_lock(&kvm->lock);
>>>> +
>>>> + if (!list_empty(&kvm->irqfds)) {
>>>> + irqfd = list_first_entry(&kvm->irqfds, struct _irqfd, list);
>>>> + list_del(&irqfd->list);
>>>> + }
>>>> +
>>>> + mutex_unlock(&kvm->lock);
>>>> +
>>>> + return irqfd;
>>>> +}
>>>> +
>>>> void
>>>> kvm_irqfd_release(struct kvm *kvm)
>>>> {
>>>> - struct _irqfd *irqfd, *tmp;
>>>> + struct _irqfd *irqfd;
>>>>
>>>> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
>>>> - if (irqfd->wqh)
>>>> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> + while ((irqfd = irqfd_pop(kvm))) {
>>>>
>>>> - flush_work(&irqfd->inject);
>>>> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>>
>>>> - mutex_lock(&kvm->lock);
>>>> - list_del(&irqfd->list);
>>>> - mutex_unlock(&kvm->lock);
>>>> + /*
>>>> + * We guarantee there will be no more notifications after
>>>> + * the remove_wait_queue returns. Now lets make sure we
>>>> + * synchronize behind any outstanding work items before
>>>> + * releasing the resources
>>>> + */
>>>> + flush_work(&irqfd->work);
>>>>
>>>> - kfree(irqfd);
>>>> + irqfd_release(irqfd);
>>>> }
>>>> +
>>>> + /*
>>>> + * We need to wait in case there are any outstanding work-items
>>>> + * in flight that had already removed themselves from the list
>>>> + * prior to entry to this function
>>>> + */
>>>>
>>>>
>>> Looks scary. Why doesn't the flush above cover all cases?
>>>
>>>
>> The path inside the while() is for when KVM wins the race and finds the
>> item in the list. It atomically removes it, and is responsible for
>> freeing it in a coordinated way. In this case, we must block with the
>> flush_work() before we can irqfd_release() so that we do not yank the
>> memory out from under a running work-item.
>>
>> The flush_scheduled_work() is for when eventfd wins the race and has
>> already removed itself from the list in the "shutdown" path in the
>> work-item. We want to make sure that kvm_irqfd_release() cannot return
>> until all work-items have exited to prevent something like the kvm.ko
>> module unloading while the work-item is still in flight.
>> Thanks Michael,
>> -Greg
>>
>>>
>>>
>>>> + flush_scheduled_work();
>>>> }
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>>
>>
>
>
>


Attachment: signature.asc
Description: OpenPGP digital signature