Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

From: Joao Martins
Date: Fri Feb 04 2022 - 14:53:48 EST


On 2/3/22 15:18, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>> how to integrate that with the iommufd work, which I hope will allow
>>> that series, and the other IOMMU drivers that can support this to be
>>> merged..
>>
>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>> obvious for sure. When you say 'integrate that with the iommufd' can you
>> expand on that?
>
> The general idea is that iommufd is the place to put all the iommu
> driver uAPI for consumption by userspace. The IOMMU feature of dirty
> tracking would belong there.
>
> So, some kind of API needs to be designed to meet the needs of the
> IOMMU drivers.
>
/me nods

I am gonna assume below is the most up-to-date to iommufd (as you pointed
out in another thread IIRC):

https://github.com/jgunthorpe/linux iommufd

Let me know if it's not :)

>> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
>> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
>> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
>> VMM to scan iommu pagetables itself and see what was marked dirty or
>> not?
>
> iommufd and VFIO container's don't co-exist, either iommufd is
> providing the IOMMU interface, or the current type 1 code - not both
> together.
>
> iommfd current approach presents the same ABI as the type1 container
> as compatability, and it is a possible direction to provide the
> iommu_domain stored dirty bits through that compat API.
>
> But, as you say, it looks unnatural and inefficient when the domain
> itself is storing the dirty bits inside the IOPTE.
>
How much is this already represented as the io-pgtable in IOMMU internal kAPI
(if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

> It need some study I haven't got into yet :)
>
Heh :)

Depending on how easy it is to obtain full extent of IO pagetables via iommu_fd
and whether userspace code can scan the dirty bits on its own ... then potentially
VMM/process can more efficiently scan the dirtied set? But if some layer needs to
somehow mediate between the vendor IOPTE representation and an UAPI IOPTE representation,
to be able to make that delegation to userspace ... then maybe both might be inefficient?
I didn't see how iommu-fd would abstract the IOPTEs lookup as far as I glanced through the
code, perhaps that's another ioctl().

>> My over-simplistic/naive view was that the proposal in the link
>> above sounded a lot simpler. While iommu-fd had more longevity for
>> many other usecases outside dirty tracking, no?
>
> I'd prefer we don't continue to hack on the type1 code if iommufd is
> expected to take over in this role - especially for a multi-vendor
> feature like dirty tracking.
>
But what strikes /specifically/ on the dirty bit feature is that it looks
simpler with the current VFIO, the heavy lifting seems to be
mostly on the IOMMU vendor. The proposed API above for VFIO looking at
the container (small changes), and IOMMU vendor would do most of it:

* Toggling API for start/stop dirty tracking
* API to get the IOVAs dirtied
* API to clear the IOVAs dirtied
[this last one I am not sure it is needed as the clear could be done as we scan
the ones, thus minimize TLB flush cost if these are separate]

At the same time, what particularly scares me perf-wise (for the device being migrated)
... is the fact that we need to dynamically split and collapse page tables to
increase the granularity of which we track. In the above interface it splits/collapses
when you turn on/off the dirty tracking (respectively). That's *probably* where we
need more flexibility, not sure.

> It is actually a pretty complicated topic because migration capable
> PCI devices are also include their own dirty tracking HW, all this
> needs to be harmonized somehow.

Do you have thoughts on what such device-dirty interface could look like?
(Perhaps too early to poke while the FSM/UAPI is being worked out)

I was wondering if container has a dirty scan/sync callback funnelled
by a vendor IOMMU ops implemented (as Shameerali patches proposed), and vfio vendor driver
provides one per device. Or propagate the dirty tracking API to vendor vfio driver[*]. The
reporting of the dirtying, though, looks hazzy to achieve if you try to make
it uniform even to userspace. Perhaps with iommu-fd you're thinking to mmap()
the dirty region back to userspace, or an iommu-fd ioctl() updates the PTEs,
while letting the kernel clear the dirty status via the mmap() object. And that
would be the common API regardless of dirty-hw scheme. Anyway, just thinking
out loud.

[*] considering the device may choose where to place its tracking storage, and
which scheme (bitmap, ring, etc) it might be.

> VFIO proposed to squash everything
> into the container code, but I've been mulling about having iommufd
> only do system iommu and push the PCI device internal tracking over to
> VFIO.
>

Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
not sure yet how much one can we 'offload' to a generic layer, at least
compared with this other proposal.

>> I have a PoC-ish using the interface in the link, with AMD IOMMU
>> dirty bit supported (including Qemu emulated amd-iommu for folks
>> lacking the hardware). Albeit the eager-spliting + collapsing of
>> IOMMU hugepages is not yet done there, and I wanted to play around
>> the emulated intel-iommu SLADS from specs looks quite similar. Happy
>> to join existing effort anyways.
>
> This sounds great, I would love to bring the AMD IOMMU along with a
> dirty tracking implementation! Can you share some patches so we can
> see what the HW implementation looks like?

Oh yes for sure! As I said I'm happy to help&implement along.
I would really love to leverage the IOMMU feature, as that relieves the
migrateable PCI device having to do the dirty tracking themselves.

And well, we seem to be getting there -- spec-wise everybody has that
feature *at least* documented :)

Give me some time (few days only, as I gotta sort some things) and I'll
respond here as follow up with link to a branch with the WIP/PoC patches.

Summing up a couple of remarks on hw, hopefully they enlight:

1) On AMD the feature is advertised by their extended feature register
as supported or not. On Intel, same in its equivalent (ECAP). On course
different bits on different IOMMUs registers. If it is supported, the
IOMMU updates (when activated) two bits per page table entry to indicate
'access' and 'dirty'. Slightly different page table format bit-wise on
where access/dirty is located (between the two vendors).

2) Change protection domain flags to enable the dirty/access tracking.
On AMD, it's the DTE flags (a new flag for dirty tracking).
[Intel does this in the PASID table entry]

3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
bit is cheap, clearing them is 'expensive' because one needs to flush
IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
operations to actually update the Access/Dirty bit in concurrency with
the CPU. The AMD manuals are a tad misleading as they talk about marking
non-present, but that would be catastrophic for migration as it would
mean a DMA target abort for the PCI device, unless I missed something obvious.
In any case, this means that the dirty bit *clearing* needs to be
batched as much as possible, to amortize the cost of flushing the IOTLB.
This is the same for Intel *IIUC*.

4) Adjust the granularity of pagetables in place:
[This item wasn't done, but it is generic to any IOMMU because it
is mostly the ability to split existing IO pages in place.]

4.a) Eager splitting and late collapsing IO hugepages -- essentially
to have tracking at finer granularity. as you know generally the
IOMMU map is generally done on the biggest IO page size, specially on
guests. We sadly can't write-protect or anything fancier that we
usually do so otherwise the PCI device gets a DMA target-abort :(
So splitting, when we start dirty tracking (what I mean by eager), and
do for the whole IO page table. When finished tracking, collapse the
pages (which should probably be deemed optional in the common case of
succeeding the migration). Guest expected to have higher IOTLB miss.
This part is particularly worrying for the IO guest performance, but
hopefully it doesn't turn out to be too bad.

4.b) Optionally starting dirtying earlier (at provisioning) and let
userspace dynamically split pages. This is to hopefully minimize the
IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
So dirty tracking would be enabled at creation of the protection domain
after the vfio container is set up, and we would use pages dirtied
as a indication of what needs to be splited. Problem is for IO page
sizes bigger than 1G, which might unnecessarily lead to marking too
much as dirty early on; but at least it's better than transferring the
whole set.

Joao