Re: [PATCH] Revert "drm/msm/gpu: Push gpu lock down past runpm"

From: Rob Clark
Date: Wed Jan 10 2024 - 09:55:19 EST


On Wed, Jan 10, 2024 at 2:50 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
>
> On Tue, Jan 09, 2024 at 10:22:17AM -0800, Rob Clark wrote:
> > From: Rob Clark <robdclark@xxxxxxxxxxxx>
> >
> > This reverts commit abe2023b4cea192ab266b351fd38dc9dbd846df0.
> >
> > Changing the locking order means that scheduler/msm_job_run() can race
> > with the recovery kthread worker, with the result that the GPU gets an
> > extra runpm get when we are trying to power it off. Leaving the GPU in
> > an unrecovered state.
>
> The recovery kthread is supposed to stop all the relevant schedulers,
> which should remove any possible race conditions. So unless there's more
> going on, or you have your own recovery kthread (don't, reuse the one from
> the scheduler with your own work items, that's why you can provide that)
> this looks like an incomplete/incorrect explanation ... ?
>
> Slightly confused

msm still uses it's own recovery, which pre-dates the scheduler
conversion. At one point (a yr or two back?) I started looking at
integrating recovery w/ scheduler.. at the time I think you talked me
out of it, but I don't remember the reason

BR,
-R

> -Sima
>
> >
> > I'll need to come up with a different scheme for appeasing lockdep.
> >
> > Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> > ---
> > drivers/gpu/drm/msm/msm_gpu.c | 11 +++++------
> > drivers/gpu/drm/msm/msm_ringbuffer.c | 7 +++++--
> > 2 files changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index 095390774f22..655002b21b0d 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -751,12 +751,14 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
> > struct msm_ringbuffer *ring = submit->ring;
> > unsigned long flags;
> >
> > - pm_runtime_get_sync(&gpu->pdev->dev);
> > + WARN_ON(!mutex_is_locked(&gpu->lock));
> >
> > - mutex_lock(&gpu->lock);
> > + pm_runtime_get_sync(&gpu->pdev->dev);
> >
> > msm_gpu_hw_init(gpu);
> >
> > + submit->seqno = submit->hw_fence->seqno;
> > +
> > update_sw_cntrs(gpu);
> >
> > /*
> > @@ -781,11 +783,8 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
> > gpu->funcs->submit(gpu, submit);
> > gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
> >
> > - hangcheck_timer_reset(gpu);
> > -
> > - mutex_unlock(&gpu->lock);
> > -
> > pm_runtime_put(&gpu->pdev->dev);
> > + hangcheck_timer_reset(gpu);
> > }
> >
> > /*
> > diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> > index e0ed27739449..548f5266a7d3 100644
> > --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> > +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> > @@ -21,8 +21,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
> >
> > msm_fence_init(submit->hw_fence, fctx);
> >
> > - submit->seqno = submit->hw_fence->seqno;
> > -
> > mutex_lock(&priv->lru.lock);
> >
> > for (i = 0; i < submit->nr_bos; i++) {
> > @@ -35,8 +33,13 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
> >
> > mutex_unlock(&priv->lru.lock);
> >
> > + /* TODO move submit path over to using a per-ring lock.. */
> > + mutex_lock(&gpu->lock);
> > +
> > msm_gpu_submit(gpu, submit);
> >
> > + mutex_unlock(&gpu->lock);
> > +
> > return dma_fence_get(submit->hw_fence);
> > }
> >
> > --
> > 2.43.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch