Re: [PATCH v8 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI

From: Moger, Babu
Date: Wed Jan 03 2024 - 14:44:07 EST




On 12/15/23 11:43, James Morse wrote:
> Intel is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
> CSU counter is allowed to return 'not ready' for a small number of
> micro-seconds after programming. To allow one CSU hardware monitor to be
> used for multiple control or monitor groups, the CPU accessing the
> monitor needs to be able to block when configuring and reading the
> counter.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
>
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.
>
> This function is only used in response to a user-space filesystem request
> (not the timing sensitive overflow code).
>
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c. When the IPI fallback is used on
> machines where MPAM needs to make an access on multiple CPUs, the counter
> read will always fail.
>
> Signed-off-by: James Morse <james.morse@xxxxxxx>
> Tested-by: Shaopeng Tan <tan.shaopeng@xxxxxxxxxxx>
> Tested-by: Peter Newman <peternewman@xxxxxxxxxx>
> Tested-by: Babu Moger <babu.moger@xxxxxxx>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@xxxxxxxxxxx>
> Reviewed-by: Peter Newman <peternewman@xxxxxxxxxx>
> Reviewed-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>

Reviewed-by: Babu Moger <babu.moger@xxxxxxx>

> ---
> Changes since v2:
> * Use cpumask_any_housekeeping() and fallback to an IPI if needed.
>
> Changes since v3:
> * Actually include the IPI fallback code.
>
> Changes since v4:
> * Tinkered with existing capitalisation.
>
> Changes since v5:
> * Added a newline.
>
> Changes since v6:
> * Moved lockdep annotations to a later patch.
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 26 +++++++++++++++++++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> 2 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index beccb0e87ba7..d07f99245851 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -19,6 +19,8 @@
> #include <linux/kernfs.h>
> #include <linux/seq_file.h>
> #include <linux/slab.h>
> +#include <linux/tick.h>
> +
> #include "internal.h"
>
> /*
> @@ -522,12 +524,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
> return ret;
> }
>
> +static int smp_mon_event_count(void *arg)
> +{
> + mon_event_count(arg);
> +
> + return 0;
> +}
> +
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first)
> {
> + int cpu;
> +
> /*
> - * setup the parameters to send to the IPI to read the data.
> + * Setup the parameters to pass to mon_event_count() to read the data.
> */
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> @@ -536,7 +547,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> +
> + /*
> + * cpumask_any_housekeeping() prefers housekeeping CPUs, but
> + * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
> + * MPAM's resctrl_arch_rmid_read() is unable to read the
> + * counters on some platforms if its called in irq context.
> + */
> + if (tick_nohz_full_cpu(cpu))
> + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + else
> + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
> }
>
> int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 8737a9d6caef..7e81268137b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -588,7 +588,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
> }
>
> /*
> - * This is called via IPI to read the CQM/MBM counters
> + * This is scheduled by mon_event_read() to read the CQM/MBM counters
> * on a domain.
> */
> void mon_event_count(void *info)

--
Thanks
Babu Moger