Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

From: Joao Martins
Date: Fri Feb 11 2022 - 16:44:28 EST


On 2/11/22 17:49, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:
>>> It is basically the same, almost certainly the user API in iommufd
>>> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
>>> just like vfio has.
>>>
>>
>> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
>> migration flow.
>
> It is essential to implement any kind of viommu behavior where
> map/unmap is occuring while the dirty tracking is running. It should
> never make a difference except in some ugly edge cases where the dma
> and unmap are racing.
>
/me nods

>> supposed to be happening (excluding P2P)? So perhaps the unmap is
>> unneeded after quiescing the VF.
>
> Yes, you don't need to unmap for migration, a simple fully synchronous
> read and clear is sufficient. But that final read, while DMA is quite,
> must obtain the latest dirty bit data.
>

The final read doesn't need special logic in VFIO/IOMMU IUC
(maybe only in the VMM)

Anyways, the above paragraph matches my understanding.

>> We have a bitmap based interface in KVM, but there's also a recent ring
>> interface for dirty tracking, which has some probably more determinism than
>> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
>> and breaking its entries on-demand IIRC, whereas Intel resembles something
>> closer to a 512 entries 'ring' with VMX PML, which tells what has been
>> dirtied.
>
> KVM has an advantage that it can throttle the rate of dirty generation
> so it can rely on logging. The IOMMU can't throttle DMA, so we are
> stuck with a bitmap
>
Yeap, sadly :(

>>> I don't know if mmap should be involed here, the dirty bitmaps are not
>>> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>>>
>> Considering that is 32MB of a bitmap per TB maybe it is cheap.
>
> Rigt. GUP fasting a couple huge pages is nothing compared to scanning
> 1TB of IO page table.
>
... With 4K PTEs which is even more ludricous expensive. Well yeah, a
lesser concern the mangling of the bitmap, when you put that way heh

>>> You have to mark it as non-present to do the final read out if
>>> something unmaps while the tracker is on - eg emulating a viommu or
>>> something. Then you mark non-present, flush the iotlb and read back
>>> the dirty bit.
>>>
>> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
>> too :) without needing VMM intervention (that's also not supported
>> in Linux).
>
> I'm sure, but dirty tracking has to happen on the kernel owned page
> table, not the user owned one I think..
>
The plumbing for the hw-accelerated vIOMMU is a little different that
a regular vIOMMU, at least IIUC host does not take an active part in the
GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
have time to fiddle with the SDM:

https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf


>>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
>>> then read and clear them.
>>>
>> It's the other way around AIUI. The dirty bits are sticky, so you flush
>> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
>> again on the next memory transaction (or ATS translation).
>
> I guess it depends on how the HW works, if it writes the dirty bit
> back to ram instantly on the first dirty, or if it only writes the
> dirty bit when flushing the iotlb.
>
The manual says roughly that the update is visible to CPU as soon as the
it updates. Particularly from the IOMMU SDM:

"Note that the setting of accessed and dirty status bits in the page tables is visible to
both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked
operations to update A and D bits must be 64-bit operations and naturally aligned on a
64-bit boundary"

And ...

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

> In any case you have to synchronize with the HW in some way to ensure
> that all dirty bits are 'pulled back' into system memory to resolve
> races (ie you need the iommu HW to release and the CPU to acquire on
> the dirty data) before trying to read them, at least for the final
> quieced system read.
>
/me nods

>>> This seems like it would be some interesting amount of driver work,
>>> but yes it could be a generic new iommu_domina op.
>>
>> I am slightly at odds that .split and .collapse at .switch() are enough.
>> But, with iommu if we are working on top of an IOMMU domain object and
>> .split and .collapse are iommu_ops perhaps that looks to be enough
>> flexibility to give userspace the ability to decide what it wants to
>> split, if it starts eargerly/warming-up tracking dirty pages.
>>
>> The split and collapsing is something I wanted to work on next, to get
>> to a stage closer to that of an RFC on the AMD side.
>
> split/collapse seems kind of orthogonal to me it doesn't really
> connect to dirty tracking other than being mostly useful during dirty
> tracking.
>
> And I wonder how hard split is when trying to atomically preserve any
> dirty bit..
>
Would would it be hard? The D bit is supposed to be replicated when you
split to smaller page size.

>> Hmmm, judging how the IOMMU works I am not sure this is particularly
>> affecting DMA performance (not sure yet about RDMA, it's something I
>> curious to see how it gets to perform with 4K IOPTEs, and with dirty
>> tracking always enabled). Considering how the bits are sticky, and
>> unless CPU clears it, it's short of a nop? Unless of course the checking
>> for A^D during an atomic memory transaction is expensive. Needs some
>> performance testing nonetheless.
>
> If you leave the bits all dirty then why do it? The point is to see
> the dirties, which means the iommu is generating a workload of dirty
> cachelines while operating it didn't do before.
>
My thinking was that if it's dirtied and in the IOTLB most likely the
descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk
to resolve an IOVA, perhaps the check for the A & D bits needing
to be updated is probably the least problem in this path. Naturally, if
it's not split, you have a much higher chance (e.g. with 1GB entries) to stay
in the IOTLB and just compare two bits *prior* to consider starting
the atomic operation to update the descriptor.

>> I forgot to mention, but the early enablement of IOMMU dirty tracking
>> was also meant to fully know since guest creation what needs to be
>> sent to the destination. Otherwise, wouldn't we need to send the whole
>> pinned set to destination, if we only start tracking dirty pages during
>> migration?
>
> ? At the start of migration you have to send everything. Dirty
> tracking is intended to allow the VM to be stopped and then have only
> a small set of data to xfer.
>
Right, that's how it works today.

This is just preemptive longterm thinking about the overal problem space (probably
unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs.
Particular that the stage *prior* to precopy takes way too long to transfer the whole
memory. So I was thinking say only transfer the pages that are populated[*] in the
second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning
(prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA
entries, to maybe only dirty a smaller subset, saving a whole bootload of time.

[*] VMs without VFIO would be even easier as the first stage page tables are non-present.

>> Also, this is probably a differentiator for iommufd, if we were to provide
>> split and collapse semantics to IOMMU domain objects that userspace can use.
>> That would get more freedom, to switch dirty-tracking, and then do the warm
>> up thingie and piggy back on what it wants to split before migration.
>> perhaps the switch() should get some flag to pick where to split, I guess.
>
> Yes, right. Split/collapse should be completely seperate from dirty
> tracking.

Yeap.

I wonder if we could start progressing the dirty tracking as a first initial series and
then have the split + collapse handling as a second part? That would be quite
nice to get me going! :D

Joao