RE: dma_map_resource() has a bad performance in pcie peer to peer transactions when iommu enabled in Linux

From: Kelly Devilliv
Date: Tue Sep 26 2023 - 11:31:10 EST


On 2023-09-26 13:32, Christian König wrote:
> Am 26.09.23 um 06:33 schrieb Kelly Devilliv:
>> On 2023-09-26 01:58, Christian König wrote:
>>> Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>>>> On 2023-09-25 19:16, Robin Murphy wrote:
>>>>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I am working on an ARM-V8 server with two gpu cards on it.
>>>>>> Recently, I need to test pcie peer to peer communication between
>>>>>> the two gpu cards, but the throughput is only 4GB/s.
>>>>>> After I explored the gpu's kernel mode driver, I found it was
>>>>>> using the dma_map_resource() API to map the peer device's MMIO
>>>>>> space. The arm iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>>>>> static dma_addr_t iommu_dma_map_resource(struct device
>>>>>> *dev, phys_addr_t phys,
>>>>>> size_t size, enum
>>>>>> dma_data_direction dir, unsigned long attrs)
>>>>>> {
>>>>>> return __iommu_dma_map(dev, phys, size,
>>>>>> dma_info_to_prot(dir,
>>>>>> false,
>>>>>> attrs) | IOMMU_MMIO,
>>>>>> dma_get_mask(dev));
>>>>>> }
>>>>>>
>>>>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
>>>>>> in PTE, which may have a negative impact on the performance of the
>>>>>> pcie peer to peer transactions.
>>>>>> /*
>>>>>> * Note that this logic is structured to accommodate Mali LPAE
>>>>>> * having stage-1-like attributes but stage-2-like permissions.
>>>>>> */
>>>>>> if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>>>>> data->iop.fmt == ARM_32_LPAE_S2) {
>>>>>> if (prot & IOMMU_MMIO)
>>>>>> pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>>>>> else if (prot & IOMMU_CACHE)
>>>>>> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>>>>> else
>>>>>> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>>>>> } else {
>>>>>> if (prot & IOMMU_MMIO)
>>>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>>>>> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>>> else if (prot & IOMMU_CACHE)
>>>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>>>>> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>>> }
>>>>>>
>>>>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
>>>>>> API and re-compile the linux kernel, the throughput then can be up
>>>>>> to 28GB/s.
>>>>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>>>>>> e.g., a substitution of dma_map_resource() API?
>>>>> Not really. Other use-cases for dma_map_resource() include DMA
>>>>> offload engines accessing FIFO registers, where allowing
>>>>> reordering, write-gathering, etc. would be a terrible idea. Thus it
>>>>> needs to assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
>>>>>
>>>>> However, the "proper" PCI peer-to-peer support under
>>>>> CONFIG_PCI_P2PDMA ended up moving away from the dma_map_resource()
>>>>> approach anyway, and allows this kind of device memory to be
>>>>> treated more like regular memory (via
>>>>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
>>>>> would be to get the GPU driver converted over to using that.
>>>> Thanks Robin.
>>>> So your suggestion is we'd better work out a new implementation just
>>>> as what it does under CONFIG_PCI_P2PDMA instead of just using the
>>>> dma_map_resource() API?
>>>>
>>>> I have explored the GPU drivers from AMD, Nvidia and habanalabs,
>>>> e.g., and found they all using the dma_map_resource() API to map the
>>>> peer device's bar address.
>>>> If so, is it possible to be a common performance issue in PCI
>>>> peer-to-peer scenario?
>>> That's not an issue, but expected behavior.
>>>
>>> When you enable IOMMU every transaction needs to go through the root
>>> complex for address translation and you completely lose the
>>> performance benefit of PCIe P2P.
>> Thanks Christian. That's true.
>>
>>> This is a hardware limitation and not really related to
>>> dma_map_resource() in any way.
>>>
>> But when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the
>> performace was significantly improved (from 4GB/s to 28GB/s), which was
>> almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology,
>> what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.
>
> The key point is that nobody really supports that configuration, so you probably
> will find nobody looking into it.
>
> BTW: ARM isn't really supported as a platform for amdgpu either. E.g. we have
> seen tons of boards which implement the PCIe standard incorrectly, if you run
> into any trouble with that you are pretty much on your own.
>

Thanks Christian. I am going to disable IOMMU or do some tricks in PCI peer-to-peer scenario.

>> I don't know if there is a way to make the memory attributes more configurable in order to be distinguished
>> from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.
>
> Well we would need to extend dma_map_resource() to include some use case
> so that the mapping attributes don't need to be guessed.
>

Hi Robin,

Is there any chance to extend the dma_map_resource() API as discussed above?

Sincerely,
Kelly

> Regards,
> Christian.
>
>>
>> Sincerely,
>> Kelly
>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Robin.
>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Platform info:
>>>>>> Linux kernel version: 5.10
>>>>>> PCIE GEN4 x16
>>>>>>
>>>>>> Sincerely,
>>>>>> Kelly
>>>>>>