Re: NULL pointer dereference in drm_dp_add_payload_part2+0xca/0x100

From: Jeff Layton
Date: Wed Apr 12 2023 - 10:02:11 EST


On Sat, 2023-04-08 at 07:46 -0400, Jeff Layton wrote:
> I've hit some repeated crashes in drm_dp_add_payload_part2. Here's one
> from this morning that occurred not long after booting the machine. I
> hadn't even logged in yet -- it was still at a gdm prompt:
>
> Apr 08 05:34:20 tleilax kernel: amdgpu 0000:30:00.0: [drm] Failed to create MST payload for port 0000000074d1d8eb: -5
> Apr 08 05:34:20 tleilax kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
> Apr 08 05:34:20 tleilax kernel: #PF: supervisor read access in kernel mode
> Apr 08 05:34:20 tleilax kernel: #PF: error_code(0x0000) - not-present page
> Apr 08 05:34:20 tleilax kernel: PGD 0 P4D 0
> Apr 08 05:34:20 tleilax kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> Apr 08 05:34:20 tleilax kernel: CPU: 8 PID: 2278 Comm: gnome-shell Kdump: loaded Not tainted 6.2.9-200.fc37.x86_64 #1
> Apr 08 05:34:20 tleilax kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019
> Apr 08 05:34:20 tleilax kernel: RIP: 0010:drm_dp_add_payload_part2+0xca/0x100 [drm_display_helper]
> Apr 08 05:34:20 tleilax kernel: Code: 8b 7e 08 44 89 e9 4c 89 c2 48 c7 c6 60 d2 55 c0 e8 ab 69 54 c5 44 89 e8 5b 5d 41 5c 41 5d e9 2d 73 a2 c5 48 8b 80 60 05 00 00 <48> 8b 76 08 4c 8b 40 60 48 85 f6 74 04 48 8b 76 08 4>
> Apr 08 05:34:20 tleilax kernel: RSP: 0018:ffffa4238a2db590 EFLAGS: 00010246
> Apr 08 05:34:20 tleilax kernel: RAX: ffff961550cac000 RBX: ffff961550cac000 RCX: ffffffffc055ca98
> Apr 08 05:34:20 tleilax kernel: RDX: ffff9615a6326140 RSI: 0000000000000000 RDI: ffff9615578a4568
> Apr 08 05:34:20 tleilax kernel: RBP: 0000000000000001 R08: 00000000fffffffb R09: 0000000000000000
> Apr 08 05:34:20 tleilax kernel: R10: 0000000000000002 R11: 0000000000000100 R12: ffff9615578a4000
> Apr 08 05:34:20 tleilax kernel: R13: ffff96154a5b8de0 R14: ffffffffc0d9d980 R15: ffff9615589c1f90
> Apr 08 05:34:20 tleilax kernel: FS: 00007f1c8ad775c0(0000) GS:ffff96241f000000(0000) knlGS:0000000000000000
> Apr 08 05:34:20 tleilax kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008 CR3: 000000012f908000 CR4: 00000000003506e0
> Apr 08 05:34:20 tleilax kernel: Call Trace:
> Apr 08 05:34:20 tleilax kernel: <TASK>
> Apr 08 05:34:20 tleilax kernel: dm_helpers_dp_mst_send_payload_allocation+0x83/0xb0 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: dc_link_allocate_mst_payload+0x16d/0x280 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: core_link_enable_stream+0x8ec/0xa10 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: ? optc1_set_drr+0x136/0x1e0 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: dce110_apply_ctx_to_hw+0x61b/0x670 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: dc_commit_state_no_check+0x39b/0xcd0 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: dc_commit_state+0x107/0x120 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: amdgpu_dm_atomic_commit_tail+0x5bf/0x2d20 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: ? cpufreq_this_cpu_can_update+0x12/0x60
> Apr 08 05:34:20 tleilax kernel: ? sugov_get_util+0x7e/0x90
> Apr 08 05:34:20 tleilax kernel: ? sugov_update_single_freq+0xb7/0x180
> Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock+0x13/0x40
> Apr 08 05:34:20 tleilax kernel: ? raw_spin_rq_lock_nested+0x1e/0x70
> Apr 08 05:34:20 tleilax kernel: ? psi_group_change+0x168/0x400
> Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock+0x15/0x30
> Apr 08 05:34:20 tleilax kernel: ? finish_task_switch.isra.0+0x9b/0x300
> Apr 08 05:34:20 tleilax kernel: ? __switch_to+0x106/0x410
> Apr 08 05:34:20 tleilax kernel: ? __schedule+0x3d4/0x13c0
> Apr 08 05:34:20 tleilax kernel: ? dma_resv_get_fences+0x11b/0x220
> Apr 08 05:34:20 tleilax kernel: ? get_nohz_timer_target+0x18/0x190
> Apr 08 05:34:20 tleilax kernel: ? lock_timer_base+0x61/0x80
> Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irqrestore+0x23/0x40
> Apr 08 05:34:20 tleilax kernel: ? __mod_timer+0x29e/0x3d0
> Apr 08 05:34:20 tleilax kernel: ? preempt_count_add+0x6a/0xa0
> Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock_irq+0x19/0x40
> Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irq+0x1b/0x40
> Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_timeout+0x13a/0x170
> Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_interruptible+0x135/0x1e0
> Apr 08 05:34:20 tleilax kernel: ? __pfx_dma_fence_default_wait_cb+0x10/0x10
> Apr 08 05:34:20 tleilax kernel: commit_tail+0x94/0x130
> Apr 08 05:34:20 tleilax kernel: drm_atomic_helper_commit+0x112/0x140
> Apr 08 05:34:20 tleilax kernel: drm_atomic_commit+0x96/0xc0
> Apr 08 05:34:20 tleilax kernel: ? __pfx___drm_printfn_info+0x10/0x10
> Apr 08 05:34:20 tleilax kernel: drm_mode_atomic_ioctl+0x959/0xb50
> Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
> Apr 08 05:34:20 tleilax kernel: drm_ioctl_kernel+0xc9/0x170
> Apr 08 05:34:20 tleilax kernel: drm_ioctl+0x22f/0x410
> Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
> Apr 08 05:34:20 tleilax kernel: amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
> Apr 08 05:34:20 tleilax kernel: __x64_sys_ioctl+0x90/0xd0
> Apr 08 05:34:20 tleilax kernel: do_syscall_64+0x5b/0x80
> Apr 08 05:34:20 tleilax kernel: ? __x64_sys_ioctl+0xa8/0xd0
> Apr 08 05:34:20 tleilax kernel: ? syscall_exit_to_user_mode+0x17/0x40
> Apr 08 05:34:20 tleilax kernel: ? do_syscall_64+0x67/0x80
> Apr 08 05:34:20 tleilax kernel: ? sched_clock_cpu+0xb/0xc0
> Apr 08 05:34:20 tleilax kernel: ? __irq_exit_rcu+0x3d/0x140
> Apr 08 05:34:20 tleilax kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
> Apr 08 05:34:20 tleilax kernel: RIP: 0033:0x7f1c8e723d6f
> Apr 08 05:34:20 tleilax kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 0>
> Apr 08 05:34:20 tleilax kernel: RSP: 002b:00007ffea61067d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> Apr 08 05:34:20 tleilax kernel: RAX: ffffffffffffffda RBX: 00005571af410fb0 RCX: 00007f1c8e723d6f
> Apr 08 05:34:20 tleilax kernel: RDX: 00007ffea6106870 RSI: 00000000c03864bc RDI: 000000000000000a
> Apr 08 05:34:20 tleilax kernel: RBP: 00007ffea6106870 R08: 0000000000000011 R09: 0000000000000011
> Apr 08 05:34:20 tleilax kernel: R10: 00005571ae320010 R11: 0000000000000246 R12: 00000000c03864bc
> Apr 08 05:34:20 tleilax kernel: R13: 000000000000000a R14: 00005571ae6ff140 R15: 00005571b0261950
> Apr 08 05:34:20 tleilax kernel: </TASK>
> Apr 08 05:34:20 tleilax kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbi>
> Apr 08 05:34:20 tleilax kernel: videobuf2_memops rapl mxm_wmi videobuf2_v4l2 wmi_bmof snd_pcm k10temp rfkill pcspkr videobuf2_common i2c_piix4 snd_timer joydev videodev snd mc parport_pc soundcore parport gpio_amdpt g>
> Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008
> Apr 08 05:34:20 tleilax kernel: ---[ end trace 0000000000000000 ]---
>
> $ ./scripts/faddr2line --list /usr/lib/debug/lib/modules/6.2.9-200.fc37.x86_64/kernel/drivers/gpu/drm/display/drm_display_helper.ko.debug drm_dp_add_payload_part2+0xca/0x100
> drm_dp_add_payload_part2+0xca/0x100:
>
> drm_dp_add_payload_part2 at /usr/src/debug/kernel-6.2.9/linux-6.2.9-200.fc37.x86_64/drivers/gpu/drm/display/drm_dp_mst_topology.c:3407
> 3402 {
> 3403 int ret = 0;
> 3404
> 3405 /* Skip failed payloads */
> 3406 if (payload->vc_start_slot == -1) {
> > 3407< drm_dbg_kms(state->dev, "Part 1 of payload creation for %s failed, skipping part 2\n",
> 3408 payload->port->connector->name);
> 3409 return -EIO;
> 3410 }
> 3411
> 3412 ret = drm_dp_create_payload_step2(mgr, payload);
>
> Since %rsi is NULL and the ->dev field is 8 bytes into the struct, I'm
> guessing that means that "state" was NULL here.
>
> I'm assuming that the real bug is in the caller (and I'm happy to help
> track that down), but would it make sense to allow this function to
> gracefully handle a NULL state pointer? IOW something like this?
>
> drm_dbg_kms(state ? state->dev : NULL, "Part 1 of payload creation for %s failed, skipping part 2\n",
>
> I think that would at least prevent this problem from crashing the machine.
>

FWIW, I patched my kernel with the above, and it did seem to save the
box from crashing when this happened again:

[14357.953046] amdgpu 0000:30:00.0: [drm] Failed to create MST payload for port 000000006d3a3885: -5
[14358.025845] [drm] DM_MST: stopping TM on aconnector: 00000000ef1bcb79 [id: 86]
[14358.593779] [drm] DM_MST: starting TM on aconnector: 00000000ef1bcb79 [id: 86]

In this case, all of my windows got moved to the secondary monitor, but
the machine stayed up and running. I think seems to mostly occur when
the display goes to sleep.

--
Jeff Layton <jlayton@xxxxxxxxxx>