Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

From: David Carrillo-Cisneros
Date: Tue Dec 27 2016 - 02:13:41 EST


>>>>> +LAZY and NOLAZY Monitoring
>>>>> +--------------------------
>>>>> +LAZY:
>>>>> +By default when monitoring is enabled, the RMIDs are not allocated
>>>>> +immediately and allocated lazily only at the first sched_in.
>>>>> +There are 2-4 RMIDs per logical processor on each package. So if a
>>>>> dual
>>>>> +package has 48 logical processors, there would be upto 192 RMIDs on
>>>>> each
>>>>> +package = total of 192x2 RMIDs.
>>>>> +There is a possibility that RMIDs can runout and in that case the read
>>>>> +reports an error since there was no RMID available to monitor for an
>>>>> +event.
>>>>> +
>>>>> +NOLAZY:
>>>>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>>>>> +mask' which is basically used to specify the packages he wants to
>>>>> +monitor. The RMIDs are statically allocated at open and failure is
>>>>> +indicated if RMIDs are not available.
>>>>> +
>>>>> +To specify monitoring on package 0 and package 1:
>>>>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>>>>> +
>>>>> +An error is thrown if packages not online are specified.
>>>>
>>>>
>>>> I very much dislike both those for adding files to the perf cgroup.
>>>> Drivers should really not do that.
>>>
>>>
>>> Is the continuous monitoring the issue or the interface (adding a file in
>>> perf_cgroup) ? I have not mentioned in the documentaion but this
>>> continuous
>>> monitoring/ monitoring mask applies only to cgroup in this patch and
>>> hence
>>> we thought a good place for that is in the cgroup itself because its per
>>> cgroup.
>>>
>>> For task events , this wont apply and we are thinking of providing a
>>> prctl
>>> based interface for user to toggle the continous monitoring ..
>>
>>
>> More fail..
>>
>>>>
>>>> I absolutely hate the second because events already have affinity.
>>>

The per-package NOLAZY flags are distinct than affinity. They modify
the behavior of something already running on that package. Besides
that, this is intended to work when there are no perf_events and
perf_events cpu field is already used in cgroup events.

>>>
>>> This applies to continuous monitoring as well when there are no events
>>> associated. Meaning if the monitoring mask is chosen and user tries to
>>> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
>>> allocated immediately. the mon_mask provides a way for the user to have
>>> guarenteed RMIDs for both that have events and for continoous
>>> monitoring(no
>>> perf event associated) (assuming user uses it when user knows he would
>>> definitely use it.. or else there is LAZY mode)
>>>
>>> Again this is cgroup specific and wont apply to task events and is needed
>>> when there are no events associated.
>>
>>
>> So no, the problem is that a driver introduces special ABI and behaviour
>> that radically departs from the regular behaviour.
>
>
> Ok , looks like the interface is the problem. Will try to fix this. We are
> just trying to have a light weight monitoring
> option so that its reasonable to monitor for a
> very long time (like lifetime of process etc). Mainly to not have all the
> perf scheduling overhead.
> May be a perf event attr option is a more reasonable approach for the user
> to choose the option ? (rather than some new interface like prctl / cgroup
> file..)

I don't see how a perf event attr option would work, since the goal of
continuous monitoring is to start CQM/CMT without a perf event.

An alternative is to add a single file to the cqm pmu directory. The
file contains which cgroups must be continuously monitored (optionally
with per-package flags):

$ cat /sys/devices/intel_cmt/cgroup_cont_monitoring
cgroup per-pkg flags
/ 0;1;0;0
g1 0;0;0;0
g1/g1_1 0:0;0;0
g2 0:1;0;0;0

to start continuous monitoring in a cgroup (flags optional, default to all 0's):
$ echo "g2/g2_1 0;1;0;0" > /sys/devices/intel_cmt/cgroup_cont_monitoring
to stop it:
$ echo "-g2/g2_1"

Note that the cgroup name is what perf_event_attr takes now, so it's
not that different from creating a perf event.


Another option is to create a directory per cgroup to monitor, so:
$ mkdir /sys/devices/intel_cmt/cgroup_cont_monitoring/g1
starts continuous monitoring in g1.

This approach is problematic, though, because the cont_monitoring
property is not hierarchical, i.e. a cgroup g1/g1_1 may need
cont_monitoring while g1 doesn't. Supporting this would require to
either do something funny with the cgroup name or add extra files to
each folder and expose all cgroups. None of these options seem good to
me.

So, my money is on a single file
"/sys/devices/intel_cmt/cgroup_cont_monitoring". Thoughts?

Thanks,
David