Re: [PATCH 1/3] drm/scheduler: Add more documentation

From: Asahi Lina
Date: Fri Jul 14 2023 - 05:51:56 EST


On 14/07/2023 18.47, Christian König wrote:
Am 14.07.23 um 11:39 schrieb Asahi Lina:
On 14/07/2023 17.40, Christian König wrote:
Am 14.07.23 um 10:21 schrieb Asahi Lina:
Document the implied lifetime rules of the scheduler (or at least the
intended ones), as well as the expectations of how resource acquisition
should be handled.

Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
   drivers/gpu/drm/scheduler/sched_main.c | 58
++++++++++++++++++++++++++++++++--
   1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 7b2bfc10c1a5..1f3bc3606239 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -43,9 +43,61 @@
    *
    * The jobs in a entity are always scheduled in the order that
they were pushed.
    *
- * Note that once a job was taken from the entities queue and
pushed to the
- * hardware, i.e. the pending queue, the entity must not be
referenced anymore
- * through the jobs entity pointer.
+ * Lifetime rules
+ * --------------
+ *
+ * Getting object lifetimes right across the stack is critical to
avoid UAF
+ * issues. The DRM scheduler has the following lifetime rules:
+ *
+ * - The scheduler must outlive all of its entities.
+ * - Jobs pushed to the scheduler are owned by it, and must only be
freed
+ *   after the free_job() callback is called.
+ * - Scheduler fences are reference-counted and may outlive the
scheduler.

+ * - The scheduler *may* be destroyed while jobs are still in flight.

That's not correct. The scheduler can only be destroyed after all the
entities serving it have been destroyed as well as all the jobs already
pushed to the hw finished.

The point of this series is to change this behavior so I can actually
use the scheduler in my use case, and that begins with formally
documenting it as Daniel suggested. That is, I need it to be safe for
jobs to not be yet complete before the scheduler is destroyed (the
entities do get destroyed first, that's the first bullet point).

Yeah, but you need to document the current situation not how you like it
to be.

Daniel told me to document how I think it should be, then fix the bugs that make it not so. That's what this series does.

We already had this discussion. Without this guarantee, I cannot build
a reasonable safe Rust abstraction. Unless you have another
suggestion, as far as I can tell it's either this or I give up on
using the DRM scheduler entirely and reimplement something else on my
own.

What might be possible to add is that the hw is still working on the
already pushed jobs, but so far that was rejected as undesirable.

Where was this rejected?

Years ago. Our initial driver suspend/resume design relied on that.
Turned out not to be a good idea

Times change, maybe it's time to revisit that decision?

~~ Lina