Re: [PATCH v4 2/4] drm/amdgpu: Rework coredump to use memory dynamically

From: André Almeida
Date: Thu Aug 17 2023 - 11:18:36 EST




Em 17/08/2023 12:04, Shashank Sharma escreveu:

On 17/08/2023 15:45, André Almeida wrote:
Hi Shashank,

Em 17/08/2023 03:41, Shashank Sharma escreveu:
Hello Andre,

On 15/08/2023 21:50, André Almeida wrote:
Instead of storing coredump information inside amdgpu_device struct,
move if to a proper separated struct and allocate it dynamically. This
will make it easier to further expand the logged information.

Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
---
v4: change kmalloc to kzalloc
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 63 ++++++++++++++--------
  2 files changed, 49 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c6a332261ab..0d560b713948 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1088,11 +1088,6 @@ struct amdgpu_device {
      uint32_t                        *reset_dump_reg_list;
      uint32_t            *reset_dump_reg_value;
      int                             num_regs;
-#ifdef CONFIG_DEV_COREDUMP
-    struct amdgpu_task_info         reset_task_info;
-    bool                            reset_vram_lost;
-    struct timespec64               reset_time;
-#endif
      bool                            scpm_enabled;
      uint32_t                        scpm_status;
@@ -1105,6 +1100,15 @@ struct amdgpu_device {
      uint32_t            aid_mask;
  };
+#ifdef CONFIG_DEV_COREDUMP
+struct amdgpu_coredump_info {
+    struct amdgpu_device        *adev;
+    struct amdgpu_task_info         reset_task_info;
+    struct timespec64               reset_time;
+    bool                            reset_vram_lost;
+};

The patch looks good to me in general, but I would recommend slightly different arrangement and segregation of GPU reset information.

Please consider a higher level structure adev->gpu_reset_info, and move everything related to reset dump info into that, including this new coredump_info structure, something like this:

struct amdgpu_reset_info {

     uint32_t *reset_dump_reg_list;

     uint32_t *reset_dump_reg_value;

     int num_regs;


Right, I can encapsulate there reset_dump members,

#ifdef CONFIG_DEV_COREDUMP

    struct amdgpu_coredump_info *coredump_info;/* keep this dynamic allocation */

but we don't need a pointer for amdgpu_coredump_info inside amdgpu_device or inside of amdgpu_device->gpu_reset_info, right?

I think it would be better if we keep all of the GPU reset related data in the same structure, so adev->gpu_reset_info->coredump_info sounds about right to me.


But after patch 2/4, we don't need to store a coredump_info pointer inside adev, this is what I meant. What would be the purpose of having this pointer? It's freed by amdgpu_devcoredump_free(), so we don't need to keep track of it.

- Shashank



#endif

}


This will make sure that all the relevant information is at the same place.

- Shashank

       amdgpu_inc_vram_lost(tmp_adev);