Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

From: André Almeida
Date: Mon Apr 24 2023 - 09:27:43 EST


Hi Christian, thank you for your comments.

Em 24/04/2023 04:03, Christian König escreveu:
Am 24.04.23 um 03:43 schrieb André Almeida:
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.

However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.

Fix this by marking the context guilty for every type of reset when a
job timeouts.

The guilty handling is pretty much broken by design and only works because we go through multiple hops of validating the entity after the job has already been pushed to the hw.

I see, thanks.


I think we should probably just remove that completely and use an approach where we check the in flight submissions in the query state IOCTL.

Like the DRM_IOCTL_I915_GET_RESET_STATS approach?

> See my other patch on the mailing list regarding that.

Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence errors" series?


Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior.


Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets.

Regards,
Christian.