Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

From: Max Gurtovoy
Date: Wed Dec 07 2022 - 05:59:42 EST



On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
What the kernel is doing is providing the abstraction to link the
controlling function to the VFIO device in a general way.

We don't want to just punt this problem to user space and say 'good
luck finding the right cdev for migration control'. If the kernel
struggles to link them then userspace will not fare better on its own.
Yes. But the right interface for that is to issue the userspace
commands for anything that is not normal PCIe function level
to the controlling funtion, and to discover the controlled functions
based on the controlling functions.

In other words: there should be absolutely no need to have any
special kernel support for the controlled function. Instead the
controlling function enumerates all the function it controls exports
that to userspace and exposes the functionality to save state from
and restore state to the controlled functions.

Why is it preferred that the migration SW will talk directly to the PF and not via VFIO interface ?

It's just an implementation detail.

I feel like it's even sounds more reasonable to have a common API like we have today to save_state/resume_state/quiesce_device/freeze_device and each device implementation will translate this functionality to its own SPEC.

If I understand your direction is to have QEMU code to talk to nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

The controlled device is not aware of any of the migration process. Only the migration SW, system admin and controlling device.

I see 2 orthogonal discussions here: NVMe standardization for LM and Linux implementation for LM.

For the NVMe standardization: I think we all agree, in high level, that primary controller manages the LM of the secondary controllers. Primary controller can list the secondary controllers. Primary controller expose APIs using its admin_queue to manage LM process of its secondary controllers. LM Capabilities will be exposed using identify_ctrl admin cmd of the primary controller.

For the Linux implementation: the direction we started last year is to have vendor specific (mlx5/hisi/..) or protocol specific (nvme/virtio/..) vfio drivers. We built an infrastructure to do that by dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated uAPIs as well to support the P2P case for live migration. Dirty page tracking is also progressing. More work is still to be done to improve this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM implementation unless there is something NVMe specific in the way of migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW and QEMU will need to be aware of that, and in this awareness we lose the benefit of generic VFIO interface.


Especially, we do not want every VFIO device to have its own crazy way
for userspace to link the controlling/controlled functions
together. This is something the kernel has to abstract away.
Yes. But the direction must go controlling to controlled, not the
other way around.

So in the source:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration process.


And in the destination:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are resumable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration resume process.