BTW, I have run the test on one 224-core ARM64 with one 32-hw_queue NVMe, theTo reiterate what I mentioned before about IOMMU DMA unmap on x86, a keyd theWe can start to collect latency data of dma unmapping vs nvme_irq()
DMA API more than an architecture-specific problem.
Given that we have so far very little data, I'd hold off any conclusion.
on both x86 and arm64.
I will see if I can get a such box for collecting the latency data.
difference is that by default it uses the non-strict (lazy) mode unmap, i.e.
we unmap in batches. ARM64 uses general default, which is strict mode, i.e.
every unmap results in an IOTLB fluch.
In my setup, if I switch to lazy unmap (set iommu.strict=0 on cmdline), then
no lockup.
Are any special IOMMU setups being used for x86, like enabling strict mode?
I don't know...
softlock issue can be triggered in one minute.
nvme_irq() often takes ~5us to complete on this machine, then there is really
risk of cpu lockup when IOPS is > 200K.
The soft lockup can be triggered too if 'iommu.strict=0' is passed in,
just takes a bit longer by starting more IO jobs.
In above test, I submit IO to one single NVMe drive from 4 CPU cores via 8 or
12 jobs(iommu.strict=0), meantime make the nvme interrupt handled just in one
dedicated CPU core.
Is there lock contention among iommu dma map and unmap callback?