Re: [PATCH] gpu: drm: remove redundant dma_fence_put() when drm_sched_job_add_dependency() fails

From: Hangyu Hua
Date: Thu Apr 28 2022 - 04:56:28 EST


On 2022/4/27 22:43, Andrey Grodzovsky wrote:

On 2022-04-26 22:31, Hangyu Hua wrote:
On 2022/4/26 22:55, Andrey Grodzovsky wrote:

On 2022-04-25 22:54, Hangyu Hua wrote:
On 2022/4/25 23:42, Andrey Grodzovsky wrote:
On 2022-04-25 04:36, Hangyu Hua wrote:

When drm_sched_job_add_dependency() fails, dma_fence_put() will be called
internally. Calling it again after drm_sched_job_add_dependency() finishes
may result in a dangling pointer.

Fix this by removing redundant dma_fence_put().

Signed-off-by: Hangyu Hua <hbh25y@xxxxxxxxx>
---
  drivers/gpu/drm/lima/lima_gem.c        | 1 -
  drivers/gpu/drm/scheduler/sched_main.c | 1 -
  2 files changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c
index 55bb1ec3c4f7..99c8e7f6bb1c 100644
--- a/drivers/gpu/drm/lima/lima_gem.c
+++ b/drivers/gpu/drm/lima/lima_gem.c
@@ -291,7 +291,6 @@ static int lima_gem_add_deps(struct drm_file *file, struct lima_submit *submit)
          err = drm_sched_job_add_dependency(&submit->task->base, fence);
          if (err) {
-            dma_fence_put(fence);
              return err;


Makes sense here


          }
      }
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index b81fceb0b8a2..ebab9eca37a8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -708,7 +708,6 @@ int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
          dma_fence_get(fence);
          ret = drm_sched_job_add_dependency(job, fence);
          if (ret) {
-            dma_fence_put(fence);



Not sure about this one since if you look at the relevant commits -
'drm/scheduler: fix drm_sched_job_add_implicit_dependencies' and
'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder'
You will see that the dma_fence_put here balances the extra dma_fence_get
above

Andrey


I don't think so. I checked the call chain and found no additional dma_fence_get(). But dma_fence_get() needs to be called before drm_sched_job_add_dependency() to keep the counter balanced.


I don't say there is an additional get, I just say that drm_sched_job_add_dependency doesn't grab an extra reference to the fences it stores so this needs to be done outside and for that
drm_sched_job_add_implicit_dependencies->dma_fence_get is called and, if this addition fails you just call dma_fence_put to keep the counter balanced.


drm_sched_job_add_implicit_dependencies() will call drm_sched_job_add_dependency(). And drm_sched_job_add_dependency() already call dma_fence_put() when it fails. Calling dma_fence_put() twice doesn't make sense.

dma_fence_get() is in [2]. But dma_fence_put() will be called in [1] and [3] when xa_alloc() fails.


The way I see it, [2] and [3] are mat matching *get* and *put* respectively. [1] *put* is against the original dma_fence_init->kref_init of the fence which always set the refcount at 1.
Also in support of this see commit 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder' - it says there "drm_sched_job_add_dependency() could drop the last ref"  - this last ref is the original refcount set by dma_fence_init->kref

Andrey


You can see that drm_sched_job_add_dependency() has three return paths they are [4], [5] and [1]. [4] and [5] will return 0. [1] will return error.

There will be three weird problems if you're right:

1. [5] path will triger a refcount leak beacause ret is 0 in *if*[6]. Otherwise [2] and [5] are matching *get* and *put* in here.

2. [4] path need a additional dma_fence_get() to adds the fence as a job dependency. fence is from obj->resv. Taking msm as an example obj->resv is from etnaviv_ioctl_gem_submit()->submit_lookup_objects(). It is not possible that an object has *refcount == 1* but is referenced in two places. So dma_fence_get() called in [2] is for [4]. By the way, [3] don't execute in this case.

3. This one is a doubt. You can see in "[PATCH] drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder". drm_sched_job_add_dependency() could drop the last ref, so we need to do
the dma_fence_get() first. But the last ref still will drop in [3] if drm_sched_job_add_dependency() go path [1]. And there is only a *return* between [1] and [3]. Is this necessary? I think Rob Clark wants to avoid the last ref being dropped in drm_sched_job_add_implicit_dependencies() because fence is still used by obj->resv.


int drm_sched_job_add_dependency(struct drm_sched_job *job,
struct dma_fence *fence)
{
...
xa_for_each(&job->dependencies, index, entry) {
if (entry->context != fence->context)
continue;

if (dma_fence_is_later(fence, entry)) {
dma_fence_put(entry);
xa_store(&job->dependencies, index, fence, GFP_KERNEL); <---- [4]
} else {
dma_fence_put(fence); <---- [5]
}
return 0;
}

ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL);
if (ret != 0)
dma_fence_put(fence); <---- [1]

return ret;
}


int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
struct drm_gem_object *obj,
bool write)
{
struct dma_resv_iter cursor;
struct dma_fence *fence;
int ret;

dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
/* Make sure to grab an additional ref on the added fence */
dma_fence_get(fence); <---- [2]
ret = drm_sched_job_add_dependency(job, fence);
if (ret) { <---- [6]
dma_fence_put(fence); <---- [3]

return ret;
}
}
return 0;
}

Thanks,
hangyu




int drm_sched_job_add_dependency(struct drm_sched_job *job,
                 struct dma_fence *fence)
{
    ...
    ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL);
    if (ret != 0)
        dma_fence_put(fence);    <--- [1]

    return ret;
}
EXPORT_SYMBOL(drm_sched_job_add_dependency);


int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
                        struct drm_gem_object *obj,
                        bool write)
{
    struct dma_resv_iter cursor;
    struct dma_fence *fence;
    int ret;

    dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
        /* Make sure to grab an additional ref on the added fence */
        dma_fence_get(fence);    <--- [2]
        ret = drm_sched_job_add_dependency(job, fence);
        if (ret) {
            dma_fence_put(fence);    <--- [3]
            return ret;
        }
    }
    return 0;
}



On the other hand, dma_fence_get() and dma_fence_put() are meaningless here if threre is an extra dma_fence_get() beacause counter will not decrease to 0 during drm_sched_job_add_dependency().

I check the call chain as follows:

msm_ioctl_gem_submit()
-> submit_fence_sync()
-> drm_sched_job_add_implicit_dependencies()


Can you maybe trace or print one such example of problematic refcount that you are trying to fix ? I still don't see where is the problem.

Andrey


I also wish I could. System logs can make this easy. But i don't have a corresponding GPU physical device. drm_sched_job_add_implicit_dependencies is only used in a few devices.

Thanks.


Thanks,
Hangyu


              return ret;
          }
      }