Re: [PATCH v4 5/6] drm/sched: Use parent fence instead of finished

From: Christian König
Date: Thu Sep 29 2022 - 10:58:13 EST


Am 29.09.22 um 16:53 schrieb Steven Price:
On 14/09/2022 17:43, Arvind Yadav wrote:
Using the parent fence instead of the finished fence
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.
I'm able to reproduce crashes on Panfrost and I believe this commit is
the cause. Specifically it's possible for job->s_fence->parent to be NULL.

The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if
the run_jobs() callback returns an error it will set s_fence->parent to
NULL after signalling s_fence->finished:

fence = sched->ops->run_job(s_job);
i++;

if (IS_ERR_OR_NULL(fence)) {
if (IS_ERR(fence))
dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));

s_job->s_fence->parent = NULL;
I don't understand the reasoning behind this change, but it doesn't seem
right to be using the parent fence when we have code which can be
setting that pointer to NULL.

Since I don't understand the reasoning my only suggestion is to revert
this patch (and potentially the dependent patch "dma-buf: Check status
of enable-signaling bit on debug"?).

Can anyone suggest a better fix?

Well, first of all please absolutely don't use drm_sched_resubmit_jobs_ext()!

It was an extremely bad idea in amdgpu to approach GPU by re-submitting jobs and it was an even worse idea to push this into the scheduler.

The design of dma_fence is that you submit that once and *only* once and then get a result for this submission. If re-submission is desirable it should be done in userspace or at least higher levels.

Apart from that, yes a NULL check is missing here but that should be trivial to fix.

Thanks,
Christian.


Thanks,

Steve

Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---

changes in v1,v2 - Enable signaling for finished fence in sche_main()
is removed

---
drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e0ab14e0fb6b..2ac28ad11432 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
struct drm_sched_job, list);
- if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
+ if (job && dma_fence_is_signaled(job->s_fence->parent)) {
/* remove job from pending_list */
list_del_init(&job->list);
@@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
if (next) {
next->s_fence->scheduled.timestamp =
- job->s_fence->finished.timestamp;
+ job->s_fence->parent->timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}