Re: drm/msm: Second DisplayPort regression in 6.8-rc1

From: Abhinav Kumar
Date: Tue Feb 20 2024 - 16:20:31 EST


Hi Johan

On 2/19/2024 2:41 AM, Johan Hovold wrote:
On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:

Since Dmitry had trouble reproducing this issue I took a closer look at
the DRM aux bridge series that Abhinav pointed and was able to track
down the bridge regressions and come up with a reproducer. I just posted
a series fixing this here:

https://lore.kernel.org/lkml/20240217150228.5788-1-johan+linaro@xxxxxxxxxx/

As I mentioned in the cover letter, I am still seeing intermittent hard
resets around the time that the DRM subsystem is initialising, which
suggests that we may be dealing with two separate DRM regressions here
however.

If the hard resets are triggered by something like unclocked hardware,
perhaps that bit could this be related to the runtime PM rework?

It seems my initial suspicion that at least some of these regressions
were related to the runtime PM work was correct. The hard resets happens
when the DP controller is runtime suspended after being probed:

[ 16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
[ 16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
[ 16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[ 16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
[ 16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
[ 16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
[ OK ] Listening on Load/Save RF Kill Swit[ 16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
[ 16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
[ 16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[ 16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
[ 16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
[ 16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[ 16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
Starting Load/Save Screen Backligh…rightness[ 16.959146] bus: 'platform': _ of backlight:backlight...
[ 16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
[ 17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
[ 17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
[ 17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
Starting Network Name Resolution...
[ 17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
[ 17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[ Starting Network Time Synchronization...
[ 17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
[ 17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
Starting Record System Boot/Shutdown in UTMP...
Starting Virtual Console Setup...
[ OK ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
[ 17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
[ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
S - IMAGE_VARIANT_STRING=SocMakenaWP
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92

< machine is reset by hypervisor >

Presumably the reset happens when controller is being shut down while
still being used by the EFI framebuffer.


I am not sure if we can conclude like that. Even if we shut off the controller when the framebuffer was still being fetched that should only cause a blank screen and not a reset because we really don't trigger a new register write / read while its fetching so as such there is no new hardware access.

One thing I must accept is that there are two differences between sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks where we tested it more thoroughly without any such issues:

1) with the chromebooks we have depthcharge and not the QC UEFI.

If we are suspecting a hand-off issue here, will it help if we try to disable the display in EFI by using "fastboot oem select-display-panel none" (assuming this is a fastboot enabled device) and see if you still hit the reset issue?

2) chromebooks used "internal_hpd" whereas the pmic_glink method used in the sc8280xp.

I am still checking if there are any code paths in the eDP/DP driver left exposed due to this difference with pm_runtime which can cause this. I am wondering if some sort of drm tracing will help to narrow down the reset point.

In the cases where the machines survives boot, the controller is never
suspended.

When investigating this I've also seen intermittent:

[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed


So this error I think is because in dp_parser_parse() ---> dp_parser_ctrl_res(), we also have a devm_phy_get().

This can return -EDEFER if the phy driver has not yet probed.

I checked the other things inside dp_parser_parse(), others calls seem to be purely DT parsing except this one. I think to avoid the confusion, we should move devm_phy_get() outside of DT parsing into a separate call or atleast add an error log inside devm_phy_get() failure below to indicate that it deferred

io->phy = devm_phy_get(&pdev->dev, "dp");
if (IS_ERR(io->phy))
return PTR_ERR(io->phy);

If my hypothesis is correct on this, then this error log (even though misleading) should be harmless for this issue because if we hit DRM_ERROR("device tree parsing failed\n"); we will skip the devm_pm_runtime_enable().

which also appears to be related to the runtime PM rework:

https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khsieh@xxxxxxxxxxx/

I believe this is enough evidence to conclude that this second
regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
incorporate pm_runtime framework into DP driver"):

#regzbot introduced: 5814b8bf086a

Has anyone given some thought to how the framebuffer handover is
supposed to work? It seems we're currently just relying on luck with
timing.



Johan