Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt

From: John Garry
Date: Fri Jan 03 2020 - 05:41:53 EST

Next message: Greg KH: "Linux Kernel Code of Conduct Committee: December 2019 report"
Previous message: Greg KH: "Re: [PATCH] staging: exfat: add STAGING prefix to config names"
In reply to: Ming Lei: "Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt"
Next in thread: Ming Lei: "Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 03/01/2020 00:46, Ming Lei wrote:

d the
DMA API more than an architecture-specific problem.

Given that we have so far very little data, I'd hold off any conclusion.

We can start to collect latency data of dma unmapping vs nvme_irq()
on both x86 and arm64.

I will see if I can get a such box for collecting the latency data.

To reiterate what I mentioned before about IOMMU DMA unmap on x86, a key
difference is that by default it uses the non-strict (lazy) mode unmap, i.e.
we unmap in batches. ARM64 uses general default, which is strict mode, i.e.
every unmap results in an IOTLB fluch.

In my setup, if I switch to lazy unmap (set iommu.strict=0 on cmdline), then
no lockup.

Are any special IOMMU setups being used for x86, like enabling strict mode?
I don't know...

BTW, I have run the test on one 224-core ARM64 with one 32-hw_queue NVMe, the
softlock issue can be triggered in one minute.

nvme_irq() often takes ~5us to complete on this machine, then there is really
risk of cpu lockup when IOPS is > 200K.

Do you have a typical nvme_irq() completion time for a mid-range x86 server?

The soft lockup can be triggered too if 'iommu.strict=0' is passed in,
just takes a bit longer by starting more IO jobs.

In above test, I submit IO to one single NVMe drive from 4 CPU cores via 8 or
12 jobs(iommu.strict=0), meantime make the nvme interrupt handled just in one
dedicated CPU core.

Well a problem with so many CPUs is that it does not scale (well) with MQ devices, like NVMe.

As CPU count goes up, device queue count doesn't and we get more contention.

Is there lock contention among iommu dma map and unmap callback?

There would be the IOVA management, but that would be common to x86. Each CPU keeps an IOVA cache, and there is a central pool of cached IOVAs, so that reduces any contention, unless the caches are exhausted.

I think most contention/bottleneck is at the SMMU HW interface, which has a single queue interface.

Thanks,
John

Next message: Greg KH: "Linux Kernel Code of Conduct Committee: December 2019 report"
Previous message: Greg KH: "Re: [PATCH] staging: exfat: add STAGING prefix to config names"
In reply to: Ming Lei: "Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt"
Next in thread: Ming Lei: "Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]