Re: [PATCH] vfio/pci: Support error recovery

From: Michael S. Tsirkin
Date: Tue Dec 13 2016 - 11:13:46 EST


On Mon, Dec 12, 2016 at 08:39:48PM -0700, Alex Williamson wrote:
> On Tue, 13 Dec 2016 05:15:13 +0200
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>
> > On Mon, Dec 12, 2016 at 03:43:13PM -0700, Alex Williamson wrote:
> > > > So just don't do it then. Topology must match between host and guest,
> > > > except maybe for the case of devices with host driver (e.g. PF)
> > > > which we might be able to synchronize against.
> > >
> > > We're talking about host kernel level handling here. The host kernel
> > > cannot defer the link reset to the user under the assumption that the
> > > user is handling the devices in a very specific way. The moment we do
> > > that, we've lost.
> >
> > The way is same as baremetal though, so why not?
>
> How do we know this? What if the user is dpdk? The kernel is
> responsible for maintaining the integrity of the system and devices,
> not the user.
>
> > And if user doesn't do what's expected, we can
> > do the full link reset on close.
>
> That's exactly my point, if we're talking about multiple devices,
> there's no guarantee that the close() for each is simultaneous. If one
> function is released before the other we cannot do a bus reset. If
> that device is then opened by another user before its sibling is
> released, then we once again cannot perform a link reset. I don't
> think it would be reasonable to mark the released device quarantined
> until the sibling is released, that would be a terrible user experience.

Not sure why you find it so terrible, and I don't think there's another way.

--
MST