NULL pointer dereference in drm_dp_add_payload_part2+0xca/0x100

From: Jeff Layton
Date: Sat Apr 08 2023 - 07:46:44 EST


I've hit some repeated crashes in drm_dp_add_payload_part2. Here's one
from this morning that occurred not long after booting the machine. I
hadn't even logged in yet -- it was still at a gdm prompt:

Apr 08 05:34:20 tleilax kernel: amdgpu 0000:30:00.0: [drm] Failed to create MST payload for port 0000000074d1d8eb: -5
Apr 08 05:34:20 tleilax kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 08 05:34:20 tleilax kernel: #PF: supervisor read access in kernel mode
Apr 08 05:34:20 tleilax kernel: #PF: error_code(0x0000) - not-present page
Apr 08 05:34:20 tleilax kernel: PGD 0 P4D 0
Apr 08 05:34:20 tleilax kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 08 05:34:20 tleilax kernel: CPU: 8 PID: 2278 Comm: gnome-shell Kdump: loaded Not tainted 6.2.9-200.fc37.x86_64 #1
Apr 08 05:34:20 tleilax kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019
Apr 08 05:34:20 tleilax kernel: RIP: 0010:drm_dp_add_payload_part2+0xca/0x100 [drm_display_helper]
Apr 08 05:34:20 tleilax kernel: Code: 8b 7e 08 44 89 e9 4c 89 c2 48 c7 c6 60 d2 55 c0 e8 ab 69 54 c5 44 89 e8 5b 5d 41 5c 41 5d e9 2d 73 a2 c5 48 8b 80 60 05 00 00 <48> 8b 76 08 4c 8b 40 60 48 85 f6 74 04 48 8b 76 08 4>
Apr 08 05:34:20 tleilax kernel: RSP: 0018:ffffa4238a2db590 EFLAGS: 00010246
Apr 08 05:34:20 tleilax kernel: RAX: ffff961550cac000 RBX: ffff961550cac000 RCX: ffffffffc055ca98
Apr 08 05:34:20 tleilax kernel: RDX: ffff9615a6326140 RSI: 0000000000000000 RDI: ffff9615578a4568
Apr 08 05:34:20 tleilax kernel: RBP: 0000000000000001 R08: 00000000fffffffb R09: 0000000000000000
Apr 08 05:34:20 tleilax kernel: R10: 0000000000000002 R11: 0000000000000100 R12: ffff9615578a4000
Apr 08 05:34:20 tleilax kernel: R13: ffff96154a5b8de0 R14: ffffffffc0d9d980 R15: ffff9615589c1f90
Apr 08 05:34:20 tleilax kernel: FS: 00007f1c8ad775c0(0000) GS:ffff96241f000000(0000) knlGS:0000000000000000
Apr 08 05:34:20 tleilax kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008 CR3: 000000012f908000 CR4: 00000000003506e0
Apr 08 05:34:20 tleilax kernel: Call Trace:
Apr 08 05:34:20 tleilax kernel: <TASK>
Apr 08 05:34:20 tleilax kernel: dm_helpers_dp_mst_send_payload_allocation+0x83/0xb0 [amdgpu]
Apr 08 05:34:20 tleilax kernel: dc_link_allocate_mst_payload+0x16d/0x280 [amdgpu]
Apr 08 05:34:20 tleilax kernel: core_link_enable_stream+0x8ec/0xa10 [amdgpu]
Apr 08 05:34:20 tleilax kernel: ? optc1_set_drr+0x136/0x1e0 [amdgpu]
Apr 08 05:34:20 tleilax kernel: dce110_apply_ctx_to_hw+0x61b/0x670 [amdgpu]
Apr 08 05:34:20 tleilax kernel: dc_commit_state_no_check+0x39b/0xcd0 [amdgpu]
Apr 08 05:34:20 tleilax kernel: dc_commit_state+0x107/0x120 [amdgpu]
Apr 08 05:34:20 tleilax kernel: amdgpu_dm_atomic_commit_tail+0x5bf/0x2d20 [amdgpu]
Apr 08 05:34:20 tleilax kernel: ? cpufreq_this_cpu_can_update+0x12/0x60
Apr 08 05:34:20 tleilax kernel: ? sugov_get_util+0x7e/0x90
Apr 08 05:34:20 tleilax kernel: ? sugov_update_single_freq+0xb7/0x180
Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock+0x13/0x40
Apr 08 05:34:20 tleilax kernel: ? raw_spin_rq_lock_nested+0x1e/0x70
Apr 08 05:34:20 tleilax kernel: ? psi_group_change+0x168/0x400
Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock+0x15/0x30
Apr 08 05:34:20 tleilax kernel: ? finish_task_switch.isra.0+0x9b/0x300
Apr 08 05:34:20 tleilax kernel: ? __switch_to+0x106/0x410
Apr 08 05:34:20 tleilax kernel: ? __schedule+0x3d4/0x13c0
Apr 08 05:34:20 tleilax kernel: ? dma_resv_get_fences+0x11b/0x220
Apr 08 05:34:20 tleilax kernel: ? get_nohz_timer_target+0x18/0x190
Apr 08 05:34:20 tleilax kernel: ? lock_timer_base+0x61/0x80
Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irqrestore+0x23/0x40
Apr 08 05:34:20 tleilax kernel: ? __mod_timer+0x29e/0x3d0
Apr 08 05:34:20 tleilax kernel: ? preempt_count_add+0x6a/0xa0
Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock_irq+0x19/0x40
Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irq+0x1b/0x40
Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_timeout+0x13a/0x170
Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_interruptible+0x135/0x1e0
Apr 08 05:34:20 tleilax kernel: ? __pfx_dma_fence_default_wait_cb+0x10/0x10
Apr 08 05:34:20 tleilax kernel: commit_tail+0x94/0x130
Apr 08 05:34:20 tleilax kernel: drm_atomic_helper_commit+0x112/0x140
Apr 08 05:34:20 tleilax kernel: drm_atomic_commit+0x96/0xc0
Apr 08 05:34:20 tleilax kernel: ? __pfx___drm_printfn_info+0x10/0x10
Apr 08 05:34:20 tleilax kernel: drm_mode_atomic_ioctl+0x959/0xb50
Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
Apr 08 05:34:20 tleilax kernel: drm_ioctl_kernel+0xc9/0x170
Apr 08 05:34:20 tleilax kernel: drm_ioctl+0x22f/0x410
Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
Apr 08 05:34:20 tleilax kernel: amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
Apr 08 05:34:20 tleilax kernel: __x64_sys_ioctl+0x90/0xd0
Apr 08 05:34:20 tleilax kernel: do_syscall_64+0x5b/0x80
Apr 08 05:34:20 tleilax kernel: ? __x64_sys_ioctl+0xa8/0xd0
Apr 08 05:34:20 tleilax kernel: ? syscall_exit_to_user_mode+0x17/0x40
Apr 08 05:34:20 tleilax kernel: ? do_syscall_64+0x67/0x80
Apr 08 05:34:20 tleilax kernel: ? sched_clock_cpu+0xb/0xc0
Apr 08 05:34:20 tleilax kernel: ? __irq_exit_rcu+0x3d/0x140
Apr 08 05:34:20 tleilax kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Apr 08 05:34:20 tleilax kernel: RIP: 0033:0x7f1c8e723d6f
Apr 08 05:34:20 tleilax kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 0>
Apr 08 05:34:20 tleilax kernel: RSP: 002b:00007ffea61067d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Apr 08 05:34:20 tleilax kernel: RAX: ffffffffffffffda RBX: 00005571af410fb0 RCX: 00007f1c8e723d6f
Apr 08 05:34:20 tleilax kernel: RDX: 00007ffea6106870 RSI: 00000000c03864bc RDI: 000000000000000a
Apr 08 05:34:20 tleilax kernel: RBP: 00007ffea6106870 R08: 0000000000000011 R09: 0000000000000011
Apr 08 05:34:20 tleilax kernel: R10: 00005571ae320010 R11: 0000000000000246 R12: 00000000c03864bc
Apr 08 05:34:20 tleilax kernel: R13: 000000000000000a R14: 00005571ae6ff140 R15: 00005571b0261950
Apr 08 05:34:20 tleilax kernel: </TASK>
Apr 08 05:34:20 tleilax kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbi>
Apr 08 05:34:20 tleilax kernel: videobuf2_memops rapl mxm_wmi videobuf2_v4l2 wmi_bmof snd_pcm k10temp rfkill pcspkr videobuf2_common i2c_piix4 snd_timer joydev videodev snd mc parport_pc soundcore parport gpio_amdpt g>
Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008
Apr 08 05:34:20 tleilax kernel: ---[ end trace 0000000000000000 ]---

$ ./scripts/faddr2line --list /usr/lib/debug/lib/modules/6.2.9-200.fc37.x86_64/kernel/drivers/gpu/drm/display/drm_display_helper.ko.debug drm_dp_add_payload_part2+0xca/0x100
drm_dp_add_payload_part2+0xca/0x100:

drm_dp_add_payload_part2 at /usr/src/debug/kernel-6.2.9/linux-6.2.9-200.fc37.x86_64/drivers/gpu/drm/display/drm_dp_mst_topology.c:3407
3402 {
3403 int ret = 0;
3404
3405 /* Skip failed payloads */
3406 if (payload->vc_start_slot == -1) {
>3407< drm_dbg_kms(state->dev, "Part 1 of payload creation for %s failed, skipping part 2\n",
3408 payload->port->connector->name);
3409 return -EIO;
3410 }
3411
3412 ret = drm_dp_create_payload_step2(mgr, payload);

Since %rsi is NULL and the ->dev field is 8 bytes into the struct, I'm
guessing that means that "state" was NULL here.

I'm assuming that the real bug is in the caller (and I'm happy to help
track that down), but would it make sense to allow this function to
gracefully handle a NULL state pointer? IOW something like this?

drm_dbg_kms(state ? state->dev : NULL, "Part 1 of payload creation for %s failed, skipping part 2\n",

I think that would at least prevent this problem from crashing the machine.

Thanks,
--
Jeff Layton <jlayton@xxxxxxxxxx>