Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

From: James Morse
Date: Fri Nov 11 2022 - 13:38:48 EST


Hi Reinette,

On 09/11/2022 19:12, Reinette Chatre wrote:
> On 11/9/2022 9:59 AM, James Morse wrote:
>> On 08/11/2022 21:28, Reinette Chatre wrote:
>>> On 11/3/2022 10:06 AM, James Morse wrote:
>>>> (I've not got to the last message in this part of the thread yes - I'm out of time this
>>>> week, back Monday!)
>>>>
>>>> On 21/10/2022 21:09, Reinette Chatre wrote:
>>>>> On 10/19/2022 6:57 AM, James Morse wrote:
>>>>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <james.morse@xxxxxxx> wrote:
>>>
>>> ...
>>>
>>>>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>>>>> user who never creates child MON groups. In case the number of MON
>>>>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>>>>> the same partitioning configuration before giving up.
>>>>>>
>>>>>> User-space can choose to do this.
>>>>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>>>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>>>>> before reporting them to user-space.
>>>>
>>>>> If I understand this scenario correctly, the kernel is already doing this.
>>>>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>>>>> the sum of the parent CTRL_MON group and all its child MON groups.
>>>>
>>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>>> then MPAM can export the counter files in the same way RDT does.
>>>>
>>>> While there are systems that have enough monitors, I don't think this is going to be the
>>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>>
>>> This sounds related to the way monitoring was done in earlier kernels. This was
>>> long before I become involved with this work. Unfortunately I am not familiar with
>>> all the history involved that ended in it being removed from the kernel.
>>
>> Yup, I'm aware there is some history to this. It's not appropriate for the llc_occupancy
>> counter as that reports state, instead of events.

> Perf counts events while a process is running

It's hooked up as an uncore PMU driver and it rejects attempts to attach it to a task.
Some useful background is it has to be told which of the existing resctrl control/monitor
groups to monitor. On x86 its just returning the the increase in events from the mbm files
in resctrl via resctrl_arch_rmid_read().
Unless you're curious [0], the details can come if/when I post it!


> so memory bandwidth monitoring may
> also be impacted by the caveats Peter mentioned for the upcoming AMD changes:
>
> https://lore.kernel.org/lkml/CALPaoCidd+WwGTyE3D74LhoL13ce+EvdTmOnyPrQN62j+zZ1fg@xxxxxxxxxxxxxx/
> ("This has the caveats that evictions while one task is running could have
> resulted from a previous task on the current CPU, but will be counted
> against the new task's software-RMID, ...")

If the logic to implement that is hidden entirely behind resctrl_arch_rmid_read(), then
there should be no problem. (the values will be noisy, but that is the best that can be
done on that platform)


Thanks,

James

[0] Beware, the changes to x86 to make resctrl_arch_rmid_read() irq safe aren't quite right.
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.0&id=b8ae575bd17e1d56db0f84dc456b964a23d252d6