Re: [PATCH v2] drm/msm/dpu: make "vblank timeout" more useful

From: Dmitry Baryshkov
Date: Wed Feb 14 2024 - 14:20:37 EST


On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
>
>
>
> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote:
> > We have several reports of vblank timeout messages. However after some
> > debugging it was found that there might be different causes to that.
> > To allow us to identify the DPU block that gets stuck, include the
> > actual CTL_FLUSH value into the timeout message and trigger the devcore
> > snapshot capture.
> >
> > Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx>
> > ---
> > Changes in v2:
> > - Added a call to msm_disp_snapshot_state() to trigger devcore dump
> > (Abhinav)
> > - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@xxxxxxxxxx
> > ---
> > drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
> > index d0f56c5c4cce..a8d6165b3c0a 100644
> > --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
> > +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
> > @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done(
> > (hw_ctl->ops.get_flush_register(hw_ctl) == 0),
> > msecs_to_jiffies(50));
> > if (ret <= 0) {
> > - DPU_ERROR("vblank timeout\n");
> > + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl));
> > + msm_disp_snapshot_state(phys_enc->parent->dev);
>
>
> There is no rate limiting in this piece of code unfortunately. So this
> will flood the number of snapshots.

Well... Yes and no. The devcoredump will destroy other snapshots if
there is a pending one. So only the console will be flooded and only
in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled.

>
> Short-term solution is you can go with a vblank_timeout_cnt and reset it
> in the enable() like other similar error counters.
>
> long-term solution is we need to centralize these error locations to one
> single dpu_encoder_error_handler() with a single counter and the error
> handler will print out the error code along with the snapshot instead of
> the snapshot being called from all over the place.
>
>
>
> > return -ETIMEDOUT;
> > }
> >
> >
> > ---
> > base-commit: 39676dfe52331dba909c617f213fdb21015c8d10
> > change-id: 20240106-fd-dpu-debug-timeout-e917f0bc8063
> >
> > Best regards,



--
With best wishes
Dmitry