Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

From: Christian König
Date: Fri Jul 14 2023 - 06:18:18 EST


Am 14.07.23 um 12:06 schrieb Asahi Lina:
On 14/07/2023 18.57, Christian König wrote:
Am 14.07.23 um 11:49 schrieb Asahi Lina:
On 14/07/2023 17.43, Christian König wrote:
Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h              | 5 +++++
    3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
               /*
             * Fence is from the same scheduler, only need to wait for
-         * it to be scheduled
+         * it to be scheduled.
+         *
+         * Note: s_fence->sched could have been freed and reallocated
+         * as another scheduler. This false positive case is okay,
as if
+         * the old scheduler was freed all of its jobs must have
+         * signaled their completion fences.

This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.

As I already explained, the fences can outlive their scheduler. That
means *this* entity certainly exists for *this* scheduler, but the
*dependency* fence might have come from a past scheduler that was
already destroyed along with all of its entities, and its address reused.

Well this is function is not about fences, this function is a callback
for the entity.

That deals with dependency fences, which could have come from any arbitrary source, including another entity and another scheduler.

No, they can't. Signaling is certainly mandatory to happen before things are released even if we allow to decouple the dma_fence from it's issuer.



Christian, I'm really getting tired of your tone. I don't appreciate
being told my comments are "outright nonsense" when you don't even
take the time to understand what the issue is and what I'm trying to
do/document. If you aren't interested in working with me, I'm just
going to give up on drm_sched, wait until Rust gets workqueue support,
and reimplement it in Rust. You can keep your broken fence lifetime
semantics and I'll do my own thing.

I'm certainly trying to help here, but you seem to have unrealistic
expectations.

I don't think expecting not to be told my changes are "outright nonsense" is an unrealistic expectation. If you think it is, maybe that's yet another indicator of the culture problems the kernel community has...

Well I'm just pointing out that you don't seem to understand the background of the things and just think this is a bug instead of intentional behavior.


I perfectly understand what you are trying to do, but you don't seem to
understand that this functionality here isn't made for your use case.

I do, that's why I'm trying to change things. Right now, this functionality isn't even properly documented, which is why I thought it could be used for my use case, and slowly discovered otherwise. Daniel suggested documenting it, then fixing the mismatches between documentation and reality, which is what I'm doing here.

Well I know Daniel for something like 10-15 years or so, I'm pretty sure that he meant that you document the existing state because otherwise this goes against usual patch submission approaches.


We can adjust the functionality to better match your requirements, but
you can't say it is broken because it doesn't work when you use it not
in the way it is intended to be used.

I'm saying the idea that a random dma-buf holds onto a chain of references that prevents unloading a driver module that wrote into it (and keeps a bunch of random unrelated objects alive) is a broken state of affairs.

Well no, this is intentional design. Otherwise the module and with it the operations pointer the fences rely on go away. We already discussed that over 10 years ago when Marten came up with the initial dma_fence design.

The resulting problems are very well known and I completely agree that they are undesirable, but this is how the framework works and not just the scheduler but the rest of the DMA-buf framework as well.

It may or may not trickle down to actual problems for users (I would bet it does in some cases but I don't know for sure), but it's a situation that doesn't make any sense.

I know I'm triggering actual breakage with my new use case due to this, which is why I'm trying to improve things. But the current state of affairs just doesn't make any sense even if it isn't causing kernel oopses today with other drivers.

You can go ahead and try to re-implement the functionality in Rust, but
then I would reject that pointing out that this should probably be an
extension to the existing code.

You keep rejecting my attempts at extending the existing code...

Well I will try to improve here and push you into the right direction instead.

Regards,
Christian.


~~ Lina