Re: [PATCH] vfio/pci: Support error recovery

From: Alex Williamson
Date: Mon Dec 12 2016 - 14:12:23 EST


On Mon, 12 Dec 2016 21:49:01 +0800
Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:

> Hi,
> I have 2 solutions(high level design) came to me, please see if they are
> acceptable, or which one is acceptable. Also have some questions.
>
> 1. block guest access during host recovery
>
> add new field error_recovering in struct vfio_pci_device to
> indicate host recovery status. aer driver in host will still do
> reset link
>
> - set error_recovering in vfio-pci driver's error_detected, used to
> block all kinds of user access(config space, mmio)
> - in order to solve concurrent issue of device resetting & user
> access, check device state[*] in vfio-pci driver's resume, see if
> device reset is done, if it is, then clear"error_recovering", or
> else new a timer, check device state periodically until device
> reset is done. (what if device reset don't end for a long time?)
> - In qemu, translate guest link reset to host link reset.
> A question here: we already have link reset in host, is a second
> link reset necessary? why?
>
> [*] how to check device state: reading certain config space
> register, check return value is valid or not(All F's)

Isn't this exactly the path we were on previously? There might be an
optimization that we could skip back-to-back resets, but how can you
necessarily infer that the resets are for the same thing? If the user
accesses the device between resets, can you still guarantee the guest
directed reset is unnecessary? If time passes between resets, do you
know they're for the same event? How much time can pass between the
host and guest reset to know they're for the same event? In the
process of error handling, which is more important, speed or
correctness?

> 2. skip link reset in aer driver of host kernel, for vfio-pci.
> Let user decide how to do serious recovery
>
> add new field "user_driver" in struct pci_dev, used to skip link
> reset for vfio-pci; add new field "link_reset" in struct
> vfio_pci_device to indicate link has been reset or not during
> recovery
>
> - set user_driver in vfio_pci_probe(), to skip link reset for
> vfio-pci in host.
> - (use a flag)block user access(config, mmio) during host recovery
> (not sure if this step is necessary)
> - In qemu, translate guest link reset to host link reset.
> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
> is executed
> - In vfio-pci driver's resume, new a timer, check "link_reset" field
> periodically, if it is set in reasonable time, then clear it and
> delete timer, or else, vfio-pci driver will does the link reset!

What happens in the case of a multifunction device where each function
is part of a separate IOMMU group and one function is hot-removed from
the user? We can't do a link reset on that function since the other
function is still in use. We have no choice but release a device in an
unknown state back to the host. As previously discussed, we don't
expect that any sort of function-level FLR will necessarily reset the
device to the same state. I also don't really like vfio-pci taking
over error handling capabilities from the PCI-core. That's redundant
code and extra maintenance overhead.

> A quick question:
> I don't know how devices is divided into iommu groups, is it possible
> for functions in a multi-function device to be split into different groups?

Yes, if a multifunction device supports ACS or if we have quirks to
expose that the functions do not perform internal peer-to-peer, then
they may be in separate IOMMU groups, depending on the rest of the PCI
topology. See:

http://vfio.blogspot.com/2014/08/iommu-groups-inside-and-out.html

Thanks,
Alex