Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

From: Felix Kuehling
Date: Thu Jan 05 2023 - 10:27:36 EST


Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
[AMD Official Use Only - General]

-----Original Message-----
From: Hegde, Vasant <Vasant.Hegde@xxxxxxx>
Sent: Thursday, January 5, 2023 5:46 AM
To: Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx>; Matt Fagnani
<matt.fagnani@xxxxxxxx>; Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>;
Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Joerg Roedel
<jroedel@xxxxxxx>
Cc: iommu@xxxxxxxxxxxxxxx; LKML <linux-kernel@xxxxxxxxxxxxxxx>;
regressions@xxxxxxxxxxxxxxx; Linux PCI <linux-pci@xxxxxxxxxxxxxxx>; Bjorn
Helgaas <bhelgaas@xxxxxxxxxx>
Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen
when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

Baolu,


On 1/5/2023 4:07 PM, Baolu Lu wrote:
On 2023/1/5 18:27, Vasant Hegde wrote:
On 1/5/2023 6:39 AM, Matt Fagnani wrote:
I built 6.2-rc2 with the patch applied. The same black screen
problem happened with 6.2-rc2 with the patch. I tried to use early
kdump with 6.2-rc2 with the patch twice by panicking the kernel with
sysrq+alt+c after the black screen happened. The system rebooted
after about 10-20 seconds both times, but no kdump and dmesg files
were saved in /var/crash. I'm attaching the lspci -vvv output as
requested.
Thanks for testing. As mentioned earlier I was not expecting this
patch to fix the black screen issue. It should fix kernel warnings
and IOMMU page fault related call traces. By any chance do you have the
kernel boot logs?
@Baolu,
   Looking into lspci output, it doesn't list ACS feature for
Graphics card. So with your fix it didn't enable PASID and hence it failed to
boot.
So do you mind telling why does the PASID need to be enabled for the
graphic device? Or in another word, what does the graphic driver use
the PASID for?
Honestly I don't know the complete details of how PASID works with graphics
card. May be Alex or Joerg can explain it better.
+ Felix

The GPU driver uses the pasid for shared virtual memory between the CPU and GPU. I.e., so that the user apps can use the same virtual address space on the GPU and the CPU. It also uses pasid to take advantage of recoverable device page faults using PRS.

Agreed. This applies to GPU computing on some older AMD APUs that take advantage of memory coherence and IOMMUv2 address translation to create a shared virtual address space between the CPU and GPU. In this case it seems to be a Carrizo APU. It is also true for Raven APUs.

Regards,
  Felix



Alex