Re: [PATCH] dma-buf: Move BUG_ON from _add_shared_fence to _add_shared_inplace

From: Christian KÃnig
Date: Wed Jul 04 2018 - 05:31:16 EST


Am 04.07.2018 um 11:09 schrieb Michel DÃnzer:
On 2018-07-04 10:31 AM, Christian KÃnig wrote:
Am 26.06.2018 um 16:31 schrieb Michel DÃnzer:
From: Michel DÃnzer <michel.daenzer@xxxxxxx>

Fixes the BUG_ON spuriously triggering under the following
circumstances:

* ttm_eu_reserve_buffers processes a list containing multiple BOs using
ÂÂ the same reservation object, so it calls
ÂÂ reservation_object_reserve_shared with that reservation object once
ÂÂ for each such BO.
* In reservation_object_reserve_shared, old->shared_count ==
ÂÂ old->shared_max - 1, so obj->staged is freed in preparation of an
ÂÂ in-place update.
* ttm_eu_fence_buffer_objects calls reservation_object_add_shared_fence
ÂÂ once for each of the BOs above, always with the same fence.
* The first call adds the fence in the remaining free slot, after which
ÂÂ old->shared_count == old->shared_max.
Well, the explanation here is not correct. For multiple BOs using the
same reservation object we won't call
reservation_object_add_shared_fence() multiple times because we move
those to the duplicates list in ttm_eu_reserve_buffers().

But this bug can still happen because we call
reservation_object_add_shared_fence() manually with fences for the same
context in a couple of places.

One prominent case which comes to my mind are for the VM BOs during
updates. Another possibility are VRAM BOs which need to be cleared.
Thanks. How about the following:

* ttm_eu_reserve_buffers calls reservation_object_reserve_shared.
* In reservation_object_reserve_shared, shared_count == shared_max - 1,
so obj->staged is freed in preparation of an in-place update.
* ttm_eu_fence_buffer_objects calls reservation_object_add_shared_fence,
after which shared_count == shared_max.
* The amdgpu driver also calls reservation_object_add_shared_fence for
the same reservation object, and the BUG_ON triggers.

I would rather completely drop the reference to the ttm_eu_* functions, cause those wrappers are completely unrelated to the problem.

Instead let's just note something like the following:

* When reservation_object_reserve_shared is called with shared_count == shared_max - 1,
 so obj->staged is freed in preparation of an in-place update.

* Now reservation_object_add_shared_fence is called with the first fence and after that shared_count == shared_max.

* After that reservation_object_add_shared_fence can be called with follow up fences from the same context, but since shared_count == shared_max we would run into this BUG_ON.

However, nothing bad would happen in
reservation_object_add_shared_inplace, since all fences use the same
context, so they can only occupy a single slot.

Prevent this by moving the BUG_ON to where an overflow would actually
happen (e.g. if a buggy caller didn't call
reservation_object_reserve_shared before).


Also, I'll add a reference to https://bugs.freedesktop.org/106418 in v2,
as I suspect this fix is necessary under the circumstances described
there as well.

The rest sounds good to me.

Regards,
Christian.