Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin
Date: Thu Dec 15 2016 - 08:52:24 EST




On 12/15/2016 06:16 AM, Alex Williamson wrote:
> On Wed, 14 Dec 2016 18:24:23 +0800
> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>
>> Sorry for late.
>> after reading all your comments, I think I will try the solution 1.
>>
>> On 12/13/2016 03:12 AM, Alex Williamson wrote:
>>> On Mon, 12 Dec 2016 21:49:01 +0800
>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>
>>>> Hi,
>>>> I have 2 solutions(high level design) came to me, please see if they are
>>>> acceptable, or which one is acceptable. Also have some questions.
>>>>
>>>> 1. block guest access during host recovery
>>>>
>>>> add new field error_recovering in struct vfio_pci_device to
>>>> indicate host recovery status. aer driver in host will still do
>>>> reset link
>>>>
>>>> - set error_recovering in vfio-pci driver's error_detected, used to
>>>> block all kinds of user access(config space, mmio)
>>>> - in order to solve concurrent issue of device resetting & user
>>>> access, check device state[*] in vfio-pci driver's resume, see if
>>>> device reset is done, if it is, then clear"error_recovering", or
>>>> else new a timer, check device state periodically until device
>>>> reset is done. (what if device reset don't end for a long time?)
>>>> - In qemu, translate guest link reset to host link reset.
>>>> A question here: we already have link reset in host, is a second
>>>> link reset necessary? why?
>>>>
>>>> [*] how to check device state: reading certain config space
>>>> register, check return value is valid or not(All F's)
>>>
>>> Isn't this exactly the path we were on previously?
>>
>> Yes, it is basically the previous path, plus the optimization.
>>
>>> There might be an
>>> optimization that we could skip back-to-back resets, but how can you
>>> necessarily infer that the resets are for the same thing? If the user
>>> accesses the device between resets, can you still guarantee the guest
>>> directed reset is unnecessary? If time passes between resets, do you
>>> know they're for the same event? How much time can pass between the
>>> host and guest reset to know they're for the same event? In the
>>> process of error handling, which is more important, speed or
>>> correctness?
>>>
>>
>> I think vfio driver itself won't know what each reset comes for, and I
>> don't quite understand why should vfio care this question, is this a new
>> question in the design?
>
> You're suggesting an optimization to eliminate one of the resets,
> and as we've discussed, I don't see removing the host induced reset
> as a viable option. That means you want to eliminate the guest
> directed reset. There are potentially three levels to do that, the
> vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
> within the guest. My comments were directed to the first option, the
> host kernel level cannot correlate user directed resets as duplicates
> of host directed resets.
>

Ah, maybe it is mistake, I don't really want to eliminate guest directed
reset very much, I was just not sure why it is very necessary.

The optimization I said just is fully separating host recovery from
guest recovery(timer, check device periodically) in time, because there
is concurrent device resetting & user access.

>> But I think it make sense that the user access during 2 resets maybe a
>> trouble for guest recovery, misbehaved user could be out of our
>> imagination. Correctness is more important.
>>
>> If I understand you right, let me make a summary: host recovery just
>> does link reset, which is incomplete, so we'd better do a complete guest
>> recovery for correctness.
>
> We don't know whether the host link reset is incomplete, but we can't do
> a link reset transparently to the device, the device is no longer in the
> same state after the reset. The device specific driver, which exists
> in userspace needs to be involved in device recovery. Therefore
> regardless of how QEMU handles the error, the driver within the guest
> needs to be notified and perform recovery. Since the device is PCI and
> we're on x86 and nobody wants to introduce paravirtual error recovery,
> we must use AER. Part of AER recovery includes the possibility of
> performing a link reset. So it seems this eliminates avoiding the link
> reset within the guest.
>
> That leaves QEMU. Here we need to decide whether a guest triggered
> link reset induces a host link reset. The current working theory is
> that yes, this must be the case. If there is ever a case where a
> driver within the guest could trigger a link reset for the purposes
> of error recovery when the host has not, I think this must be the
> case. Therefore, at least some guest induced link resets must become
> host link resets. Currently we assume all guest induced link resets
> become host link resets. Minimally to avoid that, QEMU would need to
> know (not assume) whether the host performed a link reset. Even with
> that, QEMU would need to be able to correlate that a link reset from
> the guest is a duplicate of a link reset that was already performed by
> the host. That implies that QEMU needs to deduce the intention of
> the guest. That seems like a complicated task for a patch series that
> is already complicated enough, especially for a feature of questionable
> value given the configuration restrictions (imo).
>
> I would much rather focus on getting it right and making it as simple
> as we can, even if that means links get reset one too many times on
> error.
>

Thanks very much for your detailed explanation, it does helps me to
understand your concern, understand why a second link reset is necessary.

I still want to share my thoughts with you(not argue): now we know host
aer driver will do link reset for vfio-pci first, so I can say, even if
fatal error is link related, after host link reset, link can work now.
Then in qemu, we are not necessary to translate guest link reset to host
link reset, just use vfio_pci_reset() as it is to do device
reset(probably is FLR). Which also means we don't need following
patch(make code easier):

@@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev)

trace_vfio_pci_reset(vdev->vbasedev.name);

+ if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+ PCIDevice *br = pci_bridge_get_device(pdev->bus);
+
+ if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
+ PCI_BRIDGE_CTL_BUS_RESET)) {
+ if (pci_get_function_0(pdev) == pdev) {
+ vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+ }
+ return;
+ }
+ }
+
vfio_pci_pre_reset(vdev);


I think this also implies: we have a virtual link in qemu, but a virtual
link will never be broken like a physical link.(In particular we already
know host aer driver surely will do link reset to recover physical
link). So, guest's link reset don't need to care whether virtual link is
reset, just care virtual device. And qemu "translates guest link reset
to host link reset" seems kind of taking link-reset responsibility over
from host:)

>>>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
>>>> Let user decide how to do serious recovery
>>>>
>>>> add new field "user_driver" in struct pci_dev, used to skip link
>>>> reset for vfio-pci; add new field "link_reset" in struct
>>>> vfio_pci_device to indicate link has been reset or not during
>>>> recovery
>>>>
>>>> - set user_driver in vfio_pci_probe(), to skip link reset for
>>>> vfio-pci in host.
>>>> - (use a flag)block user access(config, mmio) during host recovery
>>>> (not sure if this step is necessary)
>>>> - In qemu, translate guest link reset to host link reset.
>>>> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
>>>> is executed
>>>> - In vfio-pci driver's resume, new a timer, check "link_reset" field
>>>> periodically, if it is set in reasonable time, then clear it and
>>>> delete timer, or else, vfio-pci driver will does the link reset!
>>>
>>> What happens in the case of a multifunction device where each function
>>> is part of a separate IOMMU group and one function is hot-removed from
>>> the user? We can't do a link reset on that function since the other
>>> function is still in use. We have no choice but release a device in an
>>> unknown state back to the host.
>>
>> hot-remove from user, do you mean, for example, all functions assigned
>> to VM, then suddenly a person does something like following
>>
>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
>>
>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
>>
>> to return device to host driver, or don't bind it to host driver, let it
>> in driver-less state???
>
> Yes, the host kernel has no visiblity to how a user is making use of
> devices. To support AER we require a similar topology between host and
> guest such that a guest link reset translates to a host reset. That
> requirement is imposed by userspace, ie. QEMU. The host kernel cannot
> presume that this is the case. Therefore we could have a
> multi-function device where each function is assigned to the same or
> different users in any configuration. If a fault occurs and we defer
> to the user to perform the link reset, we have absolutely no guarantee
> that it will ever occur. If the functions are assigned to different
> users, then each user individually doesn't have the capability to
> perform a link reset. If the devices happen to be assigned to a single
> user when the error occurs, we cannot assume the user has an AER
> compatible configuration, the devices could be exposed as separate
> single function devices, any one of which might be individually removed
> from the user and made use of by the host, such as your sysfs example
> above. The host cannot perform a link reset in this case either
> as the sibling devices are still in use by the guest. Thanks,
>
> Alex
>
>

this explanation is valuable to me, so this is also why we can't do link
reset in vfio driver when one of the function is closed. And do link
reset in vfio driver until all functions are close is poor solution and
very complex(quarantine the device) as you said.

I am going to try solution 1, but I still have some consideration share
with you, this won't stop my trial, and don't have relationship with
above discussion, just FYI:

In non-virtuallization environment, from device's perspective, the steps
of a normal recovery consists of:
error_detect
mmio_enabled
link_reset
slot_reset
resume

Now in our condition, the steps become:
*link_reset* (host's, the following are guest's)
error_detect
mmio_enabled
link_reset
slot_reset
resume

Especially, some device's specific driver in guest could do some
specific work in error_detect, take igb_io_error_detected() for example.
Like the words in pci-error-recovery.txt said:

it gives the driver a chance to cleanup, waiting for pending stuff
(timers, whatever, etc...) to complete;

But if link_reset is the first step, we lost all the status(register
value, etc) in the device. Of course I don't know if this will be a
problem (might not), just curious if this has been your concern:)

--
Sincerely,
Cao jin