Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

From: Jason Gunthorpe
Date: Wed Feb 02 2022 - 13:04:23 EST


On Wed, Feb 02, 2022 at 10:30:41AM -0700, Alex Williamson wrote:
> On Wed, 2 Feb 2022 14:34:52 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx> wrote:
>
> > > From: Jason Gunthorpe [mailto:jgg@xxxxxxxxxx]
> >
> > >
> > > I see pf_qm_state_pre_save() but didn't understand why it wanted to
> > > send the first 32 bytes in the PRECOPY mode? It is fine, but it
> > > will add some complexity to continue to do this.
> >
> > That was mainly to do a quick verification between src and dst compatibility
> > before we start saving the state. I think probably we can delay that check
> > for later.
>
> In the v1 migration scheme, this was considered good practice. It
> shouldn't be limited to PRECOPY, as there's no requirement to use
> PRECOPY, but the earlier in the migration process that we can trigger a
> device or data stream compatibility fault, the better. TBH, even in
> the case where a device doesn't support live dirty tracking for a
> PRECOPY phase, using it for compatibility testing continues to seem
> like good practice.

At least with our thinking here, we'd rather the device expose an
explicit compatibility data via get/test system calls so we can build
proper infrastructure around this.

Every device will have compatibility requirements and we can build
more shared common code this way. ie qemu can ideally fetch the data
before migration starts and do an exchange with the live migration
target to see if it is OK. Orchestration can inventory the systems,
and automation can select live migration targets that can actually
work.

If it is hidden inside the migration stream it is too invisible to be
fully useful.

This is something we've been talking about here but don't have much
concrete to say for mlx5 yet.

The device still has to self-protect itself against a corrupted
migration stream impacting integrity, of course.

IIRC qemu has a nice spot to put this in the existing protocol.

Just overall, now that PRECOPY is optional, we should avoid using it
without a good reason. The driver implementation does have a cost.

Thanks,
Jason