[PATCH AUTOSEL 6.6 20/31] drm/amdkfd: Fix lock dependency warning with srcu

From: Sasha Levin
Date: Sun Jan 28 2024 - 11:27:29 EST


From: Philip Yang <Philip.Yang@xxxxxxx>

[ Upstream commit 2a9de42e8d3c82c6990d226198602be44f43f340 ]

======================================================
WARNING: possible circular locking dependency detected
6.5.0-kfd-yangp #2289 Not tainted
------------------------------------------------------
kworker/0:2/996 is trying to acquire lock:
(srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x5/0x1a0

but task is already holding lock:
((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}, at:
process_one_work+0x211/0x560

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 ((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}:
__flush_work+0x88/0x4f0
svm_range_list_lock_and_flush_work+0x3d/0x110 [amdgpu]
svm_range_set_attr+0xd6/0x14c0 [amdgpu]
kfd_ioctl+0x1d1/0x630 [amdgpu]
__x64_sys_ioctl+0x88/0xc0

-> #2 (&info->lock#2){+.+.}-{3:3}:
__mutex_lock+0x99/0xc70
amdgpu_amdkfd_gpuvm_restore_process_bos+0x54/0x740 [amdgpu]
restore_process_helper+0x22/0x80 [amdgpu]
restore_process_worker+0x2d/0xa0 [amdgpu]
process_one_work+0x29b/0x560
worker_thread+0x3d/0x3d0

-> #1 ((work_completion)(&(&process->restore_work)->work)){+.+.}-{0:0}:
__flush_work+0x88/0x4f0
__cancel_work_timer+0x12c/0x1c0
kfd_process_notifier_release_internal+0x37/0x1f0 [amdgpu]
__mmu_notifier_release+0xad/0x240
exit_mmap+0x6a/0x3a0
mmput+0x6a/0x120
do_exit+0x322/0xb90
do_group_exit+0x37/0xa0
__x64_sys_exit_group+0x18/0x20
do_syscall_64+0x38/0x80

-> #0 (srcu){.+.+}-{0:0}:
__lock_acquire+0x1521/0x2510
lock_sync+0x5f/0x90
__synchronize_srcu+0x4f/0x1a0
__mmu_notifier_release+0x128/0x240
exit_mmap+0x6a/0x3a0
mmput+0x6a/0x120
svm_range_deferred_list_work+0x19f/0x350 [amdgpu]
process_one_work+0x29b/0x560
worker_thread+0x3d/0x3d0

other info that might help us debug this:
Chain exists of:
srcu --> &info->lock#2 --> (work_completion)(&svms->deferred_list_work)

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock((work_completion)(&svms->deferred_list_work));
lock(&info->lock#2);
lock((work_completion)(&svms->deferred_list_work));
sync(srcu);

Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>
Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index a4c911fa1675..b51224a85a38 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2343,8 +2343,10 @@ static void svm_range_deferred_list_work(struct work_struct *work)
mutex_unlock(&svms->lock);
mmap_write_unlock(mm);

- /* Pairs with mmget in svm_range_add_list_work */
- mmput(mm);
+ /* Pairs with mmget in svm_range_add_list_work. If dropping the
+ * last mm refcount, schedule release work to avoid circular locking
+ */
+ mmput_async(mm);

spin_lock(&svms->deferred_list_lock);
}
--
2.43.0