Re: [PATCH 15/17] cgroup/drm: Expose GPU utilisation

From: Tejun Heo
Date: Tue Jul 25 2023 - 17:44:19 EST


Hello,

On Tue, Jul 25, 2023 at 03:08:40PM +0100, Tvrtko Ursulin wrote:
> > Also, shouldn't this be keyed by the drm device?
>
> It could have that too, or it could come later. Fun with GPUs that it not
> only could be keyed by the device, but also by the type of the GPU engine.
> (Which are a) vendor specific and b) some aree fully independent, some
> partially so, and some not at all - so it could get complicated semantics
> wise really fast.)

I see.

> If for now I'd go with drm.stat/usage_usec containing the total time spent
> how would you suggest adding per device granularity? Files as documented
> are either flag or nested, not both at the same time. So something like:
>
> usage_usec 100000
> card0 usage_usec 50000
> card1 usage_usec 50000
>
> Would or would not fly? Have two files along the lines of drm.stat and drm.dev_stat?

Please follow one of the pre-defined formats. If you want to have card
identifier and field key, it should be a nested keyed file like io.stat.

> While on this general topic, you will notice that for memory stats I have
> _sort of_ nested keyed per device format, for example on integrated Intel
> GPU:
>
> $ cat drm.memory.stat
> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>
> If one a discrete Intel GPU two more lines would appear with memory
> regions of local and local-system. But then on some server class
> multi-tile GPUs even further regions with more than one device local
> memory region. And users do want to see this granularity for container use
> cases at least.
>
> Anyway, this may not be compatible with the nested key format as
> documented in cgroup-v2.rst, although it does not explicitly say.
>
> Should I cheat and create key names based on device and memory region name
> and let userspace parse it? Like:
>
> $ cat drm.memory.stat
> card0.system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Yeah, this looks better to me. If they're reporting different values for the
same fields, they're separate keys.

Thanks.

--
tejun