Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring

From: Tony Luck
Date: Mon Aug 28 2023 - 18:14:01 EST


On Mon, Aug 28, 2023 at 02:20:04PM -0700, Reinette Chatre wrote:
>
>
> On 8/28/2023 1:59 PM, Tony Luck wrote:
> > On Mon, Aug 28, 2023 at 12:56:27PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 8/28/2023 11:46 AM, Tony Luck wrote:
> >>> On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
> >>>> Hi Tony,
> >>>>
> >>>> On 8/25/2023 10:05 AM, Tony Luck wrote:
> >>>>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
> >>>>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
> >>>>
> >>>>>>> Change all places where monitoring functions walk the list of
> >>>>>>> domains to use the new "mondomains" list instead of the old
> >>>>>>> "domains" list.
> >>>>>>
> >>>>>> I would not refer to it as "the old domains list" as it creates
> >>>>>> impression that this is being replaced. The changelog makes
> >>>>>> no mention that domains list will remain and be dedicated to
> >>>>>> control domains. I think this is important to include in description
> >>>>>> of this change.
> >>>>>
> >>>>> I've rewritten the entire commit message incorporating your suggestions.
> >>>>> V6 will be posted soon (after I get some time on an SNC SPR to check
> >>>>> that it all works!)
> >>>>
> >>>> I seem to have missed v5.
> >>>
> >>> I simply can't count. You are correct that next version to be posted
> >>> will be v5.
> >>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> >>>>>>> ---
> >>>>>>> include/linux/resctrl.h | 10 +-
> >>>>>>> arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
> >>>>>>> arch/x86/kernel/cpu/resctrl/core.c | 195 +++++++++++++++-------
> >>>>>>> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
> >>>>>>> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> >>>>>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 ++--
> >>>>>>> 6 files changed, 167 insertions(+), 74 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>>>>>> index 8334eeacfec5..1267d56f9e76 100644
> >>>>>>> --- a/include/linux/resctrl.h
> >>>>>>> +++ b/include/linux/resctrl.h
> >>>>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
> >>>>>>> * @mon_capable: Is monitor feature available on this machine
> >>>>>>> * @num_rmid: Number of RMIDs available
> >>>>>>> * @cache_level: Which cache level defines scope of this resource
> >>>>>>> + * @mon_scope: Scope of this resource if different from cache_level
> >>>>>>
> >>>>>> I think this addition should be deferred. As it is here it the "if different
> >>>>>> from cache_level" also creates many questions (when will it be different?
> >>>>>> how will it be determined that the scope is different in order to know that
> >>>>>> mon_scope should be used?)
> >>>>>
> >>>>> I've gone in a different direction. V6 renames "cache_level" to
> >>>>> "ctrl_scope". I think this makes intent clear from step #1.
> >>>>>
> >>>>
> >>>> This change is not clear to me. Previously you changed this name
> >>>> but kept using it in code specific to cache levels. It is not clear
> >>>> to me how this time's rename would be different.
> >>>
> >>> The current "cache_level" field in the structure is describing
> >>> the scope of each instance using the cache level (2 or 3) as the
> >>> method to describe which CPUs are considered part of a group. Currently
> >>> the scope is the same for both control and monitor resources.
> >>
> >> Right.
> >>
> >>>
> >>> Would you like to see patches in this progrssion:
> >>>
> >>> 1) Rename "cache_level" to "scope". With commit comment that future
> >>> patches are going to base the scope on NUMA nodes in addtion to sharing
> >>> caches at particular levels, and will split into separate control and
> >>> monitor scope.
> >>>
> >>> 2) Split the "scope" field from first patch into "ctrl_scope" and
> >>> "mon_scope" (also with the addition of the new list for the mon_scope).
> >>>
> >>> 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
> >>>
> >>
> >> hmmm - my comment cannot be addressed through patch re-ordering.
> >> If I understand correctly you plan to change the name of "cache_level"
> >> to "ctrl_scope". My comment is that this obfuscates the code as long as
> >> you use this variable to compare against data that can only represent cache
> >> levels. This just repeats what I wrote in
> >> https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@xxxxxxxxx/
> >
> > I'm proposing more than just re-ordering. The above sequence is a
> > couple of extra patches in the series.
> >
> > Existing state of code:
> >
> > There is a single field named "cache_level" that describes how
> > CPUs are assigned to domains based on their sharing of a cache
> > at a particular level. Hard-coded values of "2" and "3" are used
> > to describe the level. This is just a scope description of which
> > CPUs are grouped together. But it is limited to just doing so
> > based on which caches are shared by those CPUs.
> >
> > Step 1:
> >
> > Change the name of the field s/cache_level/scope/. Provide an
> > enum with values RESCTRL_L3_CACHE, RESCTRL_L2_CACHE aand use
> > those througout code instead of 3, 2, and implictly passing
> > the resctrl scope to functions like get_cpu_cacheinfo_id()
> >
> > Add get_domain_id_from_scope() function that takes the enum
> > values for scope and converts to "3", "2" to pass to
> > get_cpu_cacheinfo_id().
> >
> > No functional change. Just broadening the meaning of the field
> > so that it can in future patches to describe scopes that
> > aren't a function of sharing a cache of a particular level.
>
> Right. So if I understand you are planning the same change as you did
> in V3, like below, and my original comment still stands.
>
> @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == r->cache_level) {
> + if (ci->info_list[i].level == r->scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> break;

No. I heard your complaint about mixing "scope" and "cache_level".
Latest version of that code looks like this:

1341 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
1342 struct rdt_domain *d, unsigned long cbm)
1343 {
1344 struct cpu_cacheinfo *ci;
1345 unsigned int size = 0;
1346 int cache_level;
1347 int num_b, i;
1348
1349 switch (r->ctrl_scope) {
1350 case RESCTRL_L3_CACHE:
1351 cache_level = 3;
1352 break;
1353 case RESCTRL_L2_CACHE:
1354 cache_level = 2;
1355 break;
1356 default:
1357 WARN_ON_ONCE(1);
1358 return size;
1359 }
1360
1361 num_b = bitmap_weight(&cbm, r->cache.cbm_len);
1362 ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
1363 for (i = 0; i < ci->num_leaves; i++) {
1364 if (ci->info_list[i].level == cache_level) {
1365 size = ci->info_list[i].size / r->cache.cbm_len * num_b;
1366 break;
1367 }
1368 }
1369
1370 return size / snc_nodes_per_l3_cache;
1371 }

> }
>
>
> >>>>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> >>>>>>> */
> >>>>>>> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>>>>>> {
> >>>>>>> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>>>>>> - struct list_head *add_pos = NULL;
> >>>>>>> - struct rdt_hw_domain *hw_dom;
> >>>>>>> - struct rdt_domain *d;
> >>>>>>> - int err;
> >>>>>>> -
> >>>>>>> - d = rdt_find_domain(r, id, &add_pos);
> >>>>>>> - if (IS_ERR(d)) {
> >>>>>>> - pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> >>>>>>> - return;
> >>>>>>> - }
> >>>>>>> -
> >>>>>>> - if (d) {
> >>>>>>> - cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>>>> - if (r->cache.arch_has_per_cpu_cfg)
> >>>>>>> - rdt_domain_reconfigure_cdp(r);
> >>>>>>> - return;
> >>>>>>> - }
> >>>>>>> -
> >>>>>>> - hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>>>>>> - if (!hw_dom)
> >>>>>>> - return;
> >>>>>>> -
> >>>>>>> - d = &hw_dom->d_resctrl;
> >>>>>>> - d->id = id;
> >>>>>>> - cpumask_set_cpu(cpu, &d->cpu_mask);
> >>>>>>> -
> >>>>>>> - rdt_domain_reconfigure_cdp(r);
> >>>>>>> -
> >>>>>>> - if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >>>>>>> - domain_free(hw_dom);
> >>>>>>> - return;
> >>>>>>> - }
> >>>>>>> -
> >>>>>>> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >>>>>>> - domain_free(hw_dom);
> >>>>>>> - return;
> >>>>>>> - }
> >>>>>>> -
> >>>>>>> - list_add_tail(&d->list, add_pos);
> >>>>>>> -
> >>>>>>> - err = resctrl_online_domain(r, d);
> >>>>>>> - if (err) {
> >>>>>>> - list_del(&d->list);
> >>>>>>> - domain_free(hw_dom);
> >>>>>>> - }
> >>>>>>> + if (r->alloc_capable)
> >>>>>>> + domain_add_cpu_ctrl(cpu, r);
> >>>>>>> + if (r->mon_capable)
> >>>>>>> + domain_add_cpu_mon(cpu, r);
> >>>>>>> }
> >>>>>>
> >>>>>> A resource could be both alloc and mon capable ... both
> >>>>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
> >>>>>> Should domain_add_cpu_mon() still be run for a CPU if
> >>>>>> domain_add_cpu_ctrl() failed?
> >>>>>>
> >>>>>> Looking ahead the CPU should probably also not be added
> >>>>>> to the default groups mask if a failure occurred.
> >>>>>
> >>>>> Existing code doesn't do anything for the case where a CPU
> >>>>> can't be added to a domain (probably the only real error case
> >>>>> is failure to allocate memory for the domain structure).
> >>>>
> >>>> Is my statement about CPU being added to default group mask
> >>>> incorrect? Seems like a potential issue related to domain's
> >>>> CPU mask also.
> >>>>
> >>>> Please see my earlier question. Existing code does not proceed
> >>>> with monitor initialization if control initialization fails and
> >>>> undoes control initialization if monitor initialization fails.
> >>>
> >>> Existing code silently continues if a domain structure cannot
> >>> be allocated to add a CPU to a domain:
> >>>
> >>> 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >>> 504 {
> >>> 505 int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> >>> 506 struct list_head *add_pos = NULL;
> >>> 507 struct rdt_hw_domain *hw_dom;
> >>> 508 struct rdt_domain *d;
> >>> 509 int err;
> >>>
> >>> ...
> >>>
> >>> 523
> >>> 524 hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> >>> 525 if (!hw_dom)
> >>> 526 return;
> >>> 527
> >>>
> >>
> >>
> >> Right ... and if it returns silently as above it runs:
> >>
> >> static int resctrl_online_cpu(unsigned int cpu)
> >> {
> >>
> >>
> >> for_each_capable_rdt_resource(r)
> >> domain_add_cpu(cpu, r);
> >> >>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<
> >>
> >> }
> >>
> >> Also, note within domain_add_cpu():
> >>
> >> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> >> {
> >>
> >>
> >> ...
> >> if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> >> domain_free(hw_dom);
> >> return;
> >> }
> >>
> >> if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> >> domain_free(hw_dom);
> >> return;
> >> }
> >>
> >> ...
> >> }
> >>
> >> The above is the other item that I've been trying to discuss
> >> with you. Note that existing resctrl will not initialize monitoring if
> >> control could not be initialized.
> >> Compare with this submission:
> >>
> >> if (r->alloc_capable)
> >> domain_add_cpu_ctrl(cpu, r);
> >> if (r->mon_capable)
> >> domain_add_cpu_mon(cpu, r);
> >>
> >>
>
> I'll stop trying

I'll see what I can do to improve the error handling.

-Tony