答复: dma_map_resource() has a bad performance in pcie peer to peer transactions when iommu enabled in Linux

From: Kelly Devilliv
Date: Tue Sep 26 2023 - 00:33:22 EST


On 2023-09-26 01:58, Christian König wrote:
> Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>> On 2023-09-25 19:16, Robin Murphy wrote:
>>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>>> Dear all,
>>>>
>>>> I am working on an ARM-V8 server with two gpu cards on it. Recently,
>>>> I need
>>>> to test pcie peer to peer communication between the two gpu cards,
>>>> but the throughput is only 4GB/s.
>>>> After I explored the gpu's kernel mode driver, I found it was using
>>>> the dma_map_resource() API to map the peer device's MMIO space. The arm
>>>> iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>>> static dma_addr_t iommu_dma_map_resource(struct device
>>>> *dev,
>>>> phys_addr_t phys,
>>>> size_t size, enum
>>>> dma_data_direction
>>>> dir, unsigned long attrs)
>>>> {
>>>> return __iommu_dma_map(dev, phys, size,
>>>> dma_info_to_prot(dir,
>>>> false,
>>>> attrs) | IOMMU_MMIO,
>>>> dma_get_mask(dev));
>>>> }
>>>>
>>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
>>>> in PTE,
>>>> which may have a negative impact on the performance of the pcie peer
>>>> to peer transactions.
>>>> /*
>>>> * Note that this logic is structured to accommodate Mali LPAE
>>>> * having stage-1-like attributes but stage-2-like permissions.
>>>> */
>>>> if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>>> data->iop.fmt == ARM_32_LPAE_S2) {
>>>> if (prot & IOMMU_MMIO)
>>>> pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>>> else if (prot & IOMMU_CACHE)
>>>> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>>> else
>>>> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>>> } else {
>>>> if (prot & IOMMU_MMIO)
>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>>> <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>> else if (prot & IOMMU_CACHE)
>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>>> <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>> }
>>>>
>>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
>>>> API
>>>> and re-compile the linux kernel, the throughput then can be up to 28GB/s.
>>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>>>> e.g., a substitution of dma_map_resource() API?
>>>
>>> Not really. Other use-cases for dma_map_resource() include DMA
>>> offload engines accessing FIFO registers, where allowing reordering,
>>> write-gathering, etc. would be a terrible idea. Thus it needs to
>>> assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
>>>
>>> However, the "proper" PCI peer-to-peer support under
>>> CONFIG_PCI_P2PDMA ended up moving away from the
>dma_map_resource()
>>> approach anyway, and allows this kind of device memory to be treated
>>> more like regular memory (via
>>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
>>> would be to get the GPU driver converted over to using that.
>>
>> Thanks Robin.
>> So your suggestion is we'd better work out a new implementation just
>> as what it does under CONFIG_PCI_P2PDMA instead of just using the
>> dma_map_resource() API?
>>
>> I have explored the GPU drivers from AMD, Nvidia and habanalabs, e.g.,
>> and found they all using the dma_map_resource() API to map the peer
>> device's bar address.
>> If so, is it possible to be a common performance issue in PCI peer-to-peer
>> scenario?
>
> That's not an issue, but expected behavior.
>
> When you enable IOMMU every transaction needs to go through the root
> complex for address translation and you completely lose the performance
> benefit of PCIe P2P.

Thanks Christian. That's true.

>
> This is a hardware limitation and not really related to
> dma_map_resource() in any way.
>

But when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the performace was significantly improved (from 4GB/s to 28GB/s), which was almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology, what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.

I don't know if there is a way to make the memory attributes more configurable in order to be distinguished from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.

Sincerely,
Kelly

> Regards,
> Christian.
>
>>
>>> Thanks,
>>> Robin.
>>>
>>>> Thank you!
>>>>
>>>> Platform info:
>>>> Linux kernel version: 5.10
>>>> PCIE GEN4 x16
>>>>
>>>> Sincerely,
>>>> Kelly
>>>>