[PATCH 3/8] habanalabs: print context refcount value if hard reset fails

From: Oded Gabbay
Date: Wed Nov 23 2022 - 09:58:32 EST


From: Tomer Tayar <ttayar@xxxxxxxxx>

Failing to kill a user process during a hard reset can be due to a
reference to the user context which isn't released.
To make it easier to understand if this the reason for the failure and
not something else, add a print of the context refcount value.

Signed-off-by: Tomer Tayar <ttayar@xxxxxxxxx>
Reviewed-by: Oded Gabbay <ogabbay@xxxxxxxxxx>
Signed-off-by: Oded Gabbay <ogabbay@xxxxxxxxxx>
---
drivers/misc/habanalabs/common/device.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index f5864893237c..926f230def56 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -696,10 +696,22 @@ static void device_hard_reset_pending(struct work_struct *work)
flags = device_reset_work->flags | HL_DRV_RESET_FROM_RESET_THR;

rc = hl_device_reset(hdev, flags);
+
if ((rc == -EBUSY) && !hdev->device_fini_pending) {
- dev_info(hdev->dev,
- "Could not reset device. will try again in %u seconds",
- HL_PENDING_RESET_PER_SEC);
+ struct hl_ctx *ctx = hl_get_compute_ctx(hdev);
+
+ if (ctx) {
+ /* The read refcount value should subtracted by one, because the read is
+ * protected with hl_get_compute_ctx().
+ */
+ dev_info(hdev->dev,
+ "Could not reset device (compute_ctx refcount %u). will try again in %u seconds",
+ kref_read(&ctx->refcount) - 1, HL_PENDING_RESET_PER_SEC);
+ hl_ctx_put(ctx);
+ } else {
+ dev_info(hdev->dev, "Could not reset device. will try again in %u seconds",
+ HL_PENDING_RESET_PER_SEC);
+ }

queue_delayed_work(hdev->reset_wq, &device_reset_work->reset_work,
msecs_to_jiffies(HL_PENDING_RESET_PER_SEC * 1000));
--
2.25.1