Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()

From: Danilo Krummrich
Date: Tue Apr 11 2023 - 14:14:53 EST


On 4/5/23 19:39, Luben Tuikov wrote:
On 2023-03-31 01:59, Christian König wrote:
Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
It already happend a few times that patches slipped through which
implemented access to an entity through a job that was already removed
from the entities queue. Since jobs and entities might have different
lifecycles, this can potentially cause UAF bugs.

In order to make it obvious that a jobs entity pointer shouldn't be
accessed after drm_sched_entity_pop_job() was called successfully, set
the jobs entity pointer to NULL once the job is removed from the entity
queue.

Moreover, debugging a potential NULL pointer dereference is way easier
than potentially corrupted memory through a UAF.

Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>

In general "YES PLEASE!", but I fear that this will break amdgpus reset
sequence.

On the other hand when amdgpu still relies on that pointer it's clearly
a bug (which I pointed out tons of times before).

Luben any opinion on that? Could you drive cleaning that up as well?

I didn't find any references to scheduling entity after the job
is submitted to the hardware. (I commented the same in the other
thread, we just need to decide which way to go.)

AFAICS from the other mail thread it seems to be consensus to not ref-count entities and handle job statistics differently.

Should we go ahead and take this patch then? Maybe it also makes sense to send a V2 additionally adding a comment to the drm_sched_job structure mentioning that .entity must not be used after the job was taken from the entities queue.

- Danilo


Regards,
Luben


Thanks,
Christian.

---
I'm aware that drivers could already use job->entity in arbitrary places, since
they in control of when the entity is actually freed. A quick grep didn't give
me any results where this would actually be the case, however maybe I also just
didn't catch it.

If, therefore, we don't want to set job->entity to NULL I think we should at
least add a comment somewhere.
---

drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..a9c6118e534b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
drm_sched_rq_update_fifo(entity, next->submit_ts);
}
+ /* Jobs and entities might have different lifecycles. Since we're
+ * removing the job from the entities queue, set the jobs entity pointer
+ * to NULL to prevent any future access of the entity through this job.
+ */
+ sched_job->entity = NULL;
+
return sched_job;
}