[RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

From: Lei Rao
Date: Tue Dec 06 2022 - 00:59:46 EST


The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.

Signed-off-by: Lei Rao <lei.rao@xxxxxxxxx>
Signed-off-by: Yadong Li <yadong.li@xxxxxxxxx>
Signed-off-by: Chaitanya Kulkarni <kch@xxxxxxxxxx>
Reviewed-by: Eddie Dong <eddie.dong@xxxxxxxxx>
Reviewed-by: Hang Yuan <hang.yuan@xxxxxxxxx>
---
drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
1 file changed, 278 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/nvme.txt

diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+| | | | |
+| Opcode by Field | | | |
+| | | | |
++--------+---------+--------+ | | |
+| | | | Combined | Namespace | |
+| 07 | 06:02 | 01:00 | Opcode | Identifier| Command |
+| | | | | used | |
++--------+---------+--------+ | | |
+|Generic | Function| Data | | | |
+|command | |Transfer| | | |
++--------+---------+--------+-----------+-----------+------------+
+| |
+| Vendor SpecificOpcode |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Query the |
+| 1b | 10001 | 00 | 0xC4 | | data size |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Suspend the|
+| 1b | 10010 | 00 | 0xC8 | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Resume the |
+| 1b | 10011 | 00 | 0xCC | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Save the |
+| 1b | 10100 | 10 | 0xD2 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Load the |
+| 1b | 10101 | 01 | 0xD5 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data size to query |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to suspend |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to resume |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to save |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to load |
++---------+------------------------------------------------------------------------------------+
+| 47:44 | Size: means the size of the device's internal data to be loaded |
++---------+------------------------------------------------------------------------------------+
+| 63:48 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+| | | | | |
+| Bytes | I/O |Admin | Disc | Description |
+| | | | | |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 01:00 | M | M | R | PCI Vendor ID(VID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| ... | ... | ... | ... | ... |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 3072 | O | O | O | Live Migration Support |
++---------+------+------+------+-------------------------------+
+| | | | | |
+|4095:3073| O | O | O | Vendor Specific |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
--
2.34.1