RE: [6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]

From: Zhu, Jiadong
Date: Wed Jun 21 2023 - 03:47:22 EST


[AMD Official Use Only - General]

Hi,

It is fixed on https://patchwork.freedesktop.org/patch/542647/?series=119384&rev=2

Could you make sure if this patch is included.

Thanks,
Jiadong

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Mikhail Gavrilov
Sent: Wednesday, June 21, 2023 3:38 PM
To: Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Linux List Kernel Mailing <linux-kernel@xxxxxxxxxxxxxxx>
Subject: [6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]

Hi,
after commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 I see KASAN sanitizer bug message at every boot:

Backtrace:
[ 18.600551] ==================================================================
[ 18.600558] BUG: KASAN: slab-out-of-bounds in
amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[ 18.600943] Write of size 8 at addr ffff8881e4d3a098 by task kworker/8:1/133

[ 18.600952] CPU: 8 PID: 133 Comm: kworker/8:1 Tainted: G W
L ------- --- 6.4.0-0.rc7.53.fc39.x86_64+debug #1
[ 18.600960] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.331 02/24/2023
[ 18.600966] Workqueue: events
amdgpu_device_delayed_init_work_handler [amdgpu]
[ 18.601253] Call Trace:
[ 18.601256] <TASK>
[ 18.601260] dump_stack_lvl+0x76/0xd0
[ 18.601267] print_report+0xcf/0x670
[ 18.601275] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[ 18.601573] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[ 18.601865] kasan_report+0xa8/0xe0
[ 18.601870] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[ 18.602163] amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[ 18.602455] gfx_v9_0_ring_emit_ib_gfx+0x4cc/0xd50 [amdgpu]
[ 18.602767] ? amdgpu_sw_ring_ib_begin+0x1b4/0x3d0 [amdgpu]
[ 18.603061] amdgpu_ib_schedule+0x7cb/0x1570 [amdgpu]
[ 18.603354] gfx_v9_0_ring_test_ib+0x375/0x540 [amdgpu]
[ 18.603656] ? __pfx_gfx_v9_0_ring_test_ib+0x10/0x10 [amdgpu]
[ 18.603959] ? __pfx_lock_acquire+0x10/0x10
[ 18.603966] amdgpu_ib_ring_tests+0x2bc/0x490 [amdgpu]
[ 18.604260] amdgpu_device_delayed_init_work_handler+0x15/0x30 [amdgpu]
[ 18.604544] process_one_work+0x888/0x1460
[ 18.604551] ? worker_thread+0x2c8/0x12c0
[ 18.604555] ? __pfx_process_one_work+0x10/0x10
[ 18.604562] worker_thread+0x104/0x12c0
[ 18.604567] ? __kthread_parkme+0xc1/0x1f0
[ 18.604573] ? __pfx_worker_thread+0x10/0x10
[ 18.604577] kthread+0x2ee/0x3c0
[ 18.604581] ? __pfx_kthread+0x10/0x10
[ 18.604586] ret_from_fork+0x2c/0x50
[ 18.604593] </TASK>

[ 18.604598] Allocated by task 466:
[ 18.604601] kasan_save_stack+0x33/0x60
[ 18.604606] kasan_set_track+0x25/0x30
[ 18.604610] __kasan_kmalloc+0x8f/0xa0
[ 18.604614] __kmalloc+0x62/0x160
[ 18.604618] amdgpu_ring_mux_init+0x6e/0x1b0 [amdgpu]
[ 18.604905] gfx_v9_0_sw_init+0xffe/0x2930 [amdgpu]
[ 18.605197] amdgpu_device_init+0x3c36/0x7fc0 [amdgpu]
[ 18.605476] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[ 18.605753] amdgpu_pci_probe+0x279/0x9a0 [amdgpu]
[ 18.606029] local_pci_probe+0xdd/0x190
[ 18.606034] pci_device_probe+0x23a/0x770
[ 18.606039] really_probe+0x3e2/0xb80
[ 18.606044] __driver_probe_device+0x18c/0x450
[ 18.606048] driver_probe_device+0x4a/0x120
[ 18.606052] __driver_attach+0x1e5/0x4a0
[ 18.606056] bus_for_each_dev+0x109/0x190
[ 18.606061] bus_add_driver+0x2a1/0x570
[ 18.606064] driver_register+0x134/0x460
[ 18.606069] do_one_initcall+0xd5/0x3b0
[ 18.606073] do_init_module+0x238/0x770
[ 18.606079] load_module+0x5581/0x6f10
[ 18.606082] __do_sys_init_module+0x1f2/0x220
[ 18.606086] do_syscall_64+0x60/0x90
[ 18.606091] entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 18.606099] The buggy address belongs to the object at ffff8881e4d3a000
which belongs to the cache kmalloc-128 of size 128
[ 18.606106] The buggy address is located 24 bytes to the right of
allocated 128-byte region [ffff8881e4d3a000, ffff8881e4d3a080)

[ 18.606115] The buggy address belongs to the physical page:
[ 18.606119] page:00000000024dbf3d refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x1e4d3a
[ 18.606126] head:00000000024dbf3d order:1 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[ 18.606132] flags:
0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 18.606138] page_type: 0xffffffff()
[ 18.606143] raw: 0017ffffc0010200 ffff8881000428c0 dead000000000122
0000000000000000
[ 18.606148] raw: 0000000000000000 0000000000200020 00000001ffffffff
0000000000000000
[ 18.606153] page dumped because: kasan: bad access detected

[ 18.606159] Memory state around the buggy address:
[ 18.606162] ffff8881e4d39f80: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00
[ 18.606167] ffff8881e4d3a000: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00
[ 18.606172] >ffff8881e4d3a080: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 18.606176] ^
[ 18.606180] ffff8881e4d3a100: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 fc
[ 18.606184] ffff8881e4d3a180: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 18.606189] ==================================================================
[ 18.606201] Disabling lock debugging due to kernel taint

From bisect log:
5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 is the first bad commit commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360
Author: Jiadong Zhu <Jiadong.Zhu@xxxxxxx>
Date: Thu May 25 18:42:15 2023 +0800

drm/amdgpu: Implement gfx9 patch functions for resubmission

Patch the packages including CONTEXT_CONTROL and WRITE_DATA for gfx9
during the resubmission scenario.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@xxxxxxx>
Acked-by: Alex Deucher <alexander.deucher@xxxxxxx>
Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # 6.3.x

drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 80 +++++++++++++++++++++++++++++++++++
1 file changed, 80 insertions(+)

Appears only on my laptop ASUS ROG Strix G15 Advantage Edition
G513QY-HQ007 (Radeon 6800M).
I didn't see such a problem on the desktop Radeon 7900XTX and Radeon 6900XT.


Is there anything else I can help with?

--
Best Regards,
Mike Gavrilov.