[BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting

From: Jia-Ju Bai
Date: Tue Feb 01 2022 - 02:40:32 EST


Hello,

My static analysis tool reports a possible deadlock in the radeon driver in Linux 5.16:

#BUG 1
radeon_dpm_change_power_state_locked()
  mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
  radeon_fence_wait_empty()
    radeon_fence_wait_seq_timeout()
      wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_ring_backup()
  mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

When radeon_dpm_change_power_state_locked() is executed, "Wait X" is performed by holding "Lock A". If radeon_ring_backup() is executed at this time, "Wake X" cannot be performed to wake up "Wait X" in radeon_dpm_change_power_state_locked(), because "Lock A" has been already hold by radeon_dpm_change_power_state_locked(), causing a possible deadlock.
I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think this timeout can cause inefficient execution.

#BUG 2
radeon_ring_lock()
  mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
  radeon_ring_alloc()
    radeon_fence_wait_next()
      radeon_fence_wait_seq_timeout()
        wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_ring_backup()
  mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

When radeon_ring_lock() is executed, "Wait X" is performed by holding "Lock A". If radeon_ring_backup() is executed at this time, "Wake X" cannot be performed to wake up "Wait X" in radeon_ring_lock(), because "Lock A" has been already hold by radeon_ring_lock(), causing a possible deadlock.
I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think this timeout can cause inefficient execution.

I am not quite sure whether these possible problems are real and how to fix them if they are real.
Any feedback would be appreciated, thanks :)


Best wishes,
Jia-Ju Bai