Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

From: Christian König
Date: Wed Apr 26 2023 - 07:50:54 EST


Sending that once more from my mailing list address since AMD internal servers are blocking the mail.

Regards,
Christian.

Am 26.04.23 um 13:48 schrieb Christian König:
WTF? I own you a beer!

I've fixed exactly that problem during the review process of the cleanup patch and because of this didn't considered that the code is still there.

It also explains why we don't see that in our testing.

@Mikhail can you test that patch with drm-misc-next?

Thanks,
Christian.

Am 26.04.23 um 04:00 schrieb Chen, Guchun:
After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
Mikhail Gavrilov
Sent: Tuesday, April 25, 2023 9:20 PM
To: Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>; dri-devel <dri-
devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>;
Linux List Kernel Mailing <linux-kernel@xxxxxxxxxxxxxxx>
Subject: Re: BUG: KASAN: null-ptr-deref in
drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
<mikhail.v.gavrilov@xxxxxxxxx> wrote:
Important don't give up.
https://youtu.be/25zhHBGIHJ8 [40 min]
https://youtu.be/utnDR26eYBY [50 min]
https://youtu.be/DJQ_tiimW6g [12 min]
https://youtu.be/Y6AH1oJKivA [6 min]
Yes the issue is everything reproducible, but time to time it not
happens at first attempt.
I also uploaded other videos which proves that the issue definitely
exists if someone will launch those games in turn.
Reproducibility is only a matter of time.

Anyway I didn't want you to spend so much time trying to reproduce it.
This monkey business fits me more than you.
It would be better if I could collect more useful info.
Christian,
Did you manage to reproduce the problem?

At the weekend I faced with slab-use-after-free in
amdgpu_vm_handle_moved.
I didn't play in the games at this time.
The Xwayland process was affected so it leads to desktop hang.

================================================================
==
BUG: KASAN: slab-use-after-free in
amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr
ffff888295c66190 by task Xwayland:cs0/173185

CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: G        W L
-------  --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
#1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023 Call Trace:
  <TASK>
  dump_stack_lvl+0x76/0xd0
  print_report+0xcf/0x670
  ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]  ?
amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
  kasan_report+0xa8/0xe0
  ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
  amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
  amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
  ? __pfx___lock_acquire+0x10/0x10
  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]  ? mark_lock+0x101/0x16e0  ?
__lock_acquire+0xe54/0x59f0  ? __pfx_lock_release+0x10/0x10  ?
__pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
  drm_ioctl_kernel+0x1fc/0x3d0
  ? __pfx_drm_ioctl_kernel+0x10/0x10
  drm_ioctl+0x4c5/0xaa0
  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]  ?
__pfx_drm_ioctl+0x10/0x10  ? _raw_spin_unlock_irqrestore+0x66/0x80
  ? lockdep_hardirqs_on+0x81/0x110
  ? _raw_spin_unlock_irqrestore+0x4f/0x80
  amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
  __x64_sys_ioctl+0x131/0x1a0
  do_syscall_64+0x60/0x90
  ? do_syscall_64+0x6c/0x90
  ? lockdep_hardirqs_on+0x81/0x110
  ? do_syscall_64+0x6c/0x90
  ? lockdep_hardirqs_on+0x81/0x110
  ? do_syscall_64+0x6c/0x90
  ? lockdep_hardirqs_on+0x81/0x110
  ? do_syscall_64+0x6c/0x90
  ? lockdep_hardirqs_on+0x81/0x110
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7ffb71b0892d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00
f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007ffb677fe840 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
RAX: ffffffffffffffda RBX: 00007ffb677fe9f8 RCX: 00007ffb71b0892d
RDX: 00007ffb677fe900 RSI: 00000000c0186444 RDI: 000000000000000d
RBP: 00007ffb677fe890 R08: 00007ffb677fea50 R09: 00007ffb677fe8e0
R10: 0000556c4611bec0 R11: 0000000000000246 R12: 00007ffb677fe900
R13: 00000000c0186444 R14: 000000000000000d R15: 00007ffb677fe9f8
</TASK>

Allocated by task 173181:
  kasan_save_stack+0x33/0x60
  kasan_set_track+0x25/0x30
  __kasan_kmalloc+0x8f/0xa0
  __kmalloc_node+0x65/0x160
  amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
  amdgpu_bo_create_user+0xca/0x160 [amdgpu]
  amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
  drm_ioctl_kernel+0x1fc/0x3d0
  drm_ioctl+0x4c5/0xaa0
  amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
  __x64_sys_ioctl+0x131/0x1a0
  do_syscall_64+0x60/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 173185:
  kasan_save_stack+0x33/0x60
  kasan_set_track+0x25/0x30
  kasan_save_free_info+0x2e/0x50
  __kasan_slab_free+0x10b/0x1a0
  slab_free_freelist_hook+0x11e/0x1d0
  __kmem_cache_free+0xc0/0x2e0
  ttm_bo_release+0x667/0x9e0 [ttm]
  amdgpu_bo_unref+0x35/0x70 [amdgpu]
  amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
  drm_gem_handle_delete+0xe3/0x150
  drm_ioctl_kernel+0x1fc/0x3d0
  drm_ioctl+0x4c5/0xaa0
  amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
  __x64_sys_ioctl+0x131/0x1a0
  do_syscall_64+0x60/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc

Last potentially related work creation:
  kasan_save_stack+0x33/0x60
  __kasan_record_aux_stack+0x97/0xb0
  __call_rcu_common.constprop.0+0xf8/0x1af0
  drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
  dma_resv_reserve_fences+0x4dc/0x7f0
  ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
  amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
  drm_ioctl_kernel+0x1fc/0x3d0
  drm_ioctl+0x4c5/0xaa0
  amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
  __x64_sys_ioctl+0x131/0x1a0
  do_syscall_64+0x60/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc

Second to last potentially related work creation:
  kasan_save_stack+0x33/0x60
  __kasan_record_aux_stack+0x97/0xb0
  __call_rcu_common.constprop.0+0xf8/0x1af0
  drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
  amdgpu_ctx_add_fence+0x2b1/0x390 [amdgpu]
  amdgpu_cs_ioctl+0x44d0/0x5630 [amdgpu]
  drm_ioctl_kernel+0x1fc/0x3d0
  drm_ioctl+0x4c5/0xaa0
  amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
  __x64_sys_ioctl+0x131/0x1a0
  do_syscall_64+0x60/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc

The buggy address belongs to the object at ffff888295c66000 which belongs
to the cache kmalloc-1k of size 1024 The buggy address is located 400 bytes
inside of  freed 1024-byte region [ffff888295c66000, ffff888295c66400)

The buggy address belongs to the physical page:
page:00000000125ffbe3 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x295c60
head:00000000125ffbe3 order:3 entire_mapcount:0 nr_pages_mapped:0
pincount:0 anon flags:
0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0010200 ffff88810004cdc0 0000000000000000
dead000000000001
raw: 0000000000000000 0000000000100010 00000001ffffffff
0000000000000000 page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888295c66080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888295c66100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888295c66180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                          ^
  ffff888295c66200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888295c66280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
================================================================
==

--
Best Regards,
Mike Gavrilov.