Re: [Qemu-devel] [PATCH v2 2/2] vfio : add aer process

From: Zhou Jie
Date: Mon Aug 01 2016 - 21:23:58 EST


Hi, Alex

Clearly this has only been tested for a single instance of an AER error
event and resume per device. Are the things you're intending to block
actually blocked for subsequent events? Note how complete_all() fills
the done field to let all current and future waiters go through and
nowhere is there a call to reinit_completion() to drain that path.
Thanks,

Alex

Do you mean this condition?

For device 1:
error1 occurs ---- error1 resumes
error2 occurs ---- error2 resumes
error3 occurs ---- error3 resumes

In current code, I do complete_all() when error1 resumes.
And this will unblock the device
when error2 and error3 are still be processed.

So walk me through how this works. On vfio_pci_open() we call
init_completion(), which sets aer_error_completion.done equal to zero
(BTW, a user can open the device file descriptor multiple times, so
there's already a bug here).
I will call init_completion() in vfio_pci_probe.

Let's assume that an error occurs and the
user stalls a single access on wait_for_completion_interruptible().
The bulk of this function happens here:

static inline long __sched
do_wait_for_common(struct completion *x,
long (*action)(long), long timeout, int state)
{
if (!x->done) {
DECLARE_WAITQUEUE(wait, current);

__add_wait_queue_tail_exclusive(&x->wait, &wait);
do {
if (signal_pending_state(state, current)) {
timeout = -ERESTARTSYS;
break;
}
__set_current_state(state);
spin_unlock_irq(&x->wait.lock);
timeout = action(timeout);
spin_lock_irq(&x->wait.lock);
} while (!x->done && timeout);
__remove_wait_queue(&x->wait, &wait);
if (!x->done)
return timeout;
}
x->done--;
return timeout ?: 1;
}

So it waits within that do{}while loop for a completion, interruption,
or timeout. Then we have:

void complete_all(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done += UINT_MAX/2;
__wake_up_locked(&x->wait, TASK_NORMAL, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

So aer_error_completion.done gets incremented to let a couple billion
completion waiters through... Show me how another call to
wait_for_completion_interruptible() will ever block again within our
lifetime when the actual wait of do_wait_for_common() is only entered
when 'done' count is equal to zero. This seems to be why
reinit_completion() exists, but it's not used here. Thanks,

Alex

I will call reinit_completion() in vfio_pci_aer_err_detected when
an aer error is detected.
Thank you very much.

Sincerely
ZhouJie