On 14/09/2022 17:43, Arvind Yadav wrote:
Using the parent fence instead of the finished fenceI'm able to reproduce crashes on Panfrost and I believe this commit is
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.
the cause. Specifically it's possible for job->s_fence->parent to be NULL.
The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if
the run_jobs() callback returns an error it will set s_fence->parent to
NULL after signalling s_fence->finished:
fence = sched->ops->run_job(s_job);I don't understand the reasoning behind this change, but it doesn't seem
i++;
if (IS_ERR_OR_NULL(fence)) {
if (IS_ERR(fence))
dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
s_job->s_fence->parent = NULL;
right to be using the parent fence when we have code which can be
setting that pointer to NULL.
Since I don't understand the reasoning my only suggestion is to revert
this patch (and potentially the dependent patch "dma-buf: Check status
of enable-signaling bit on debug"?).
Can anyone suggest a better fix?
Thanks,
Steve
Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
changes in v1,v2 - Enable signaling for finished fence in sche_main()
is removed
---
drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e0ab14e0fb6b..2ac28ad11432 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
struct drm_sched_job, list);
- if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
+ if (job && dma_fence_is_signaled(job->s_fence->parent)) {
/* remove job from pending_list */
list_del_init(&job->list);
@@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
if (next) {
next->s_fence->scheduled.timestamp =
- job->s_fence->finished.timestamp;
+ job->s_fence->parent->timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}