On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
Would it be possible for amdgpu to, in the system suspend task,
On 2022-03-17 14:25, Rob Clark wrote:
On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
On 2022-03-17 13:35, Rob Clark wrote:In the system suspend path, userspace processes will have already been
On Thu, Mar 17, 2022 at 9:45 AM Christian KönigI am not sure how this drains the scheduler ? Suppose we done the
<christian.koenig@xxxxxxx> wrote:
Am 17.03.22 um 17:18 schrieb Rob Clark:we don't wait on fences in shrinker, only purging or evicting things
On Thu, Mar 17, 2022 at 9:04 AM Christian KönigNo, it's much wider than that.
<christian.koenig@xxxxxxx> wrote:
Am 17.03.22 um 16:10 schrieb Rob Clark:Hmm, perhaps that is true if you need to migrate things out of vram?
[SNIP]Well exactly that's the problem. The scheduler is supposed to shoveling
userspace frozen != kthread frozen .. that is what this patch is
trying to address, so we aren't racing between shutting down the hw
and the scheduler shoveling more jobs at us.
more jobs at us until it is empty.
Thinking more about it we will then keep some dma_fence instance
unsignaled and that is and extremely bad idea since it can lead to
deadlocks during suspend.
It is at least not a problem when vram is not involved.
See what can happen is that the memory management shrinkers want to wait
for a dma_fence during suspend.
that are already ready. Actually, waiting on fences in shrinker path
sounds like a pretty bad idea.
And if you stop the scheduler they will just wait forever.yeah, it would work to drain the scheduler.. I guess that might be the
What you need to do instead is to drain the scheduler, e.g. call
drm_sched_entity_flush() with a proper timeout for each entity you have
created.
more portable approach as far as generic solution for suspend.
BR,
-R
waiting in drm_sched_entity_flush,
what prevents someone to push right away another job into the same
entity's queue right after that ?
Shouldn't we first disable further pushing of jobs into entity before we
wait for sched->job_scheduled ?
frozen, so there should be no way to push more jobs to the scheduler,
unless they are pushed from the kernel itself.
amdgpu_device_suspend
It was my suspicion but I wasn't sure about it.
We don't do that in
drm/msm, but maybe you need to to move things btwn vram and system
memory?
Exactly, that was my main concern - if we use this method we have to use
it in a point in
suspend sequence when all the in kernel job submissions activity already
suspended
But even in that case, if the # of jobs you push is bounded ISubmissions to scheduler entities are using unbounded queue, the bounded
guess that is ok?
part is when
you extract next job from entity to submit to HW ring and it rejects if
submission limit reached (drm_sched_ready)
In general - It looks to me at least that what we what we want her is
more of a drain operation then flush (i.e.
we first want to disable any further job submission to entity's queue
and then flush all in flight ones). As example
for this i was looking at flush_workqueue vs. drain_workqueue
1) first queue up all the jobs needed to migrate bos out of vram, and
whatever other housekeeping jobs are needed
2) then drain gpu scheduler's queues
3) and then finally wait for jobs executing on GPU to complete
BR,
-R
Andrey
BR,
-R