Re: [linux-next][mainline/master] [IPR] [Function could be = "__mutex_lock_slowpath(lock)"]OOPs kernel crash while performing IPR test

From: Mohamed Khalfella
Date: Mon Jan 29 2024 - 14:23:59 EST


On 2023-08-27 13:56:14 +0530, Tasmiya Nalatwad wrote:
> Greetings,
>
> [linux-next][mainline/master] [IPR] [Function could be =
> "__mutex_lock_slowpath(lock)"]OOPs kernel crash while performing IPR test

Hello,

We hit this issue while testing 6.6.9 LTS kernel and I narrowed it down
to commit fcaa174a9c99 ("scsi/sg: don't grab scsi host module reference").
Not holding a reference to the scsi_device caused the last reference to
be dropped in sg_remove_sfp_usercontext(). This caused request_queue to
be set to NULL in scsi_device_dev_release(). Passing NULL to blk_trace_remove()
caused this panic. More detail below.

The issue can be reproduced by having userspace process holding the last
refcount to device that was removed.

# python3
\>>> import os
\>>> fd = os.open('/dev/sg22', os.O_RDONLY)
\>>> # wait until the device is removed
\>>> os.close(fd)
#

# echo 1 > /sys/bus/pci/devices/0000\:5e\:00.0/remove
# # Now run >>> os.close(fd) above

python3-14739 53..... 3782240930us : sg_remove_sfp_kprobe: (sg_remove_sfp+0x0/0xa0 <ffffffff816dd5c0>) kref=0xffff88b047055320
python3-14739 53..... 3782240934us : <stack trace>
=> sg_remove_sfp+0x1/0xa0 <ffffffff816dd5c1>
=> sg_release+0xa2/0x100 <ffffffff816de5e2>
=> __fput+0xe9/0x280 <ffffffff812fcf79>
=> __x64_sys_close+0x39/0x80 <ffffffff812f58a9>
=> do_syscall_64+0x35/0x80 <ffffffff81b57485>
=> entry_SYSCALL_64_after_hwframe+0x46/0xb0 <ffffffff81c0006a>
kworker/-2357 53..... 3782240948us : scsi_device_dev_release_kprobe: (scsi_device_dev_release+0x0/0x2c0 <ffffffff816c0680>) device=0xffff88ac553a61c0
kworker/-2357 53..... 3782240951us : <stack trace>
=> scsi_device_dev_release+0x1/0x2c0 <ffffffff816c0681>
=> device_release+0x31/0x90 <ffffffff81662fc1>
=> kobject_put+0x6d/0x180 <ffffffff81b3527d>
=> scsi_device_put+0x20/0x30 <ffffffff816b1190>
=> sg_remove_sfp_usercontext+0xfb/0x190 <ffffffff816de73b>
=> process_one_work+0x133/0x2f0 <ffffffff810a5983>
=> worker_thread+0x2ec/0x400 <ffffffff810a6dbc>
=> kthread+0xe2/0x110 <ffffffff810aed42>
=> ret_from_fork+0x2d/0x50 <ffffffff8103ddad>
=> ret_from_fork_asm+0x11/0x20 <ffffffff810017d1>

python3-14739 was holding the last refcount. sg_remove_sfp() queued
sg_remove_sfp_usercontext() for execution. scsi_device_dev_release()
set sdev->request_queue to NULL causing the panic.

kworker/49:1-607 [049] ..... 519.002877: scsi_device_dev_release_kprobe: (scsi_device_dev_release+0x0/0x2c0 <ffffffff816c0680>) device=0xffff889d227bf1c0
kworker/49:1-607 [049] ..... 519.002882: <stack trace>
=> scsi_device_dev_release+0x1/0x2c0 <ffffffff816c0681>
=> device_release+0x31/0x90 <ffffffff81662fc1>
=> kobject_put+0x6d/0x180 <ffffffff81b3526d>
=> scsi_device_put+0x20/0x30 <ffffffff816b1190>
=> sg_device_destroy+0x2f/0xb0 <ffffffff816dc16f>
=> sg_remove_sfp_usercontext+0x133/0x190 <ffffffff816de763>
=> process_one_work+0x133/0x2f0 <ffffffff810a5983>
=> worker_thread+0x2ec/0x400 <ffffffff810a6dbc>
=> kthread+0xe2/0x110 <ffffffff810aed42>
=> ret_from_fork+0x2d/0x50 <ffffffff8103ddad>
=> ret_from_fork_asm+0x11/0x20 <ffffffff810017d1>

Reverting 80b6051085c5 ("scsi: sg: Fix checking return value of
blk_get_queue()") and fcaa174a9c99 ("scsi/sg: don't grab scsi host module
reference") fixed the problem. The stacktrace above is showing the last
refcount of the scsi_device is dropped from sg_device_destroy().