Re: [PATCH 2/2] x86/resctrl: Add tracepoint for llc_occupancy tracking

From: Haifeng Xu
Date: Sun Mar 03 2024 - 22:08:16 EST




On 2024/3/2 01:47, James Morse wrote:
> Hello!
>
> On 29/02/2024 07:11, Haifeng Xu wrote:
>> In our production environment, after removing monitor groups, those unused
>> RMIDs get stuck in the limbo list forever because their llc_occupancy are
>> always larger than the threshold. But the unused RMIDs can be successfully
>> freed by turning up the threshold.
>>
>> In order to know how much the threshold should be, perf can be used to acquire
>> the llc_occupancy of RMIDs in each rdt domain.
>>
>> Instead of using perf tool to track llc_occupancy and filter the log manually,
>> it is more convenient for users to use tracepoint to do this work. So add a new
>> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
>> list.
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c34a35ec0f03..ada392ca75b2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>
> [expanded the diff hunk for better context]
>> entry = __rmid_entry(idx);
>> if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
>> QOS_L3_OCCUP_EVENT_ID, &val,
>> arch_mon_ctx)) {
>> rmid_dirty = true;
>> } else {
>> rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
>> }
>>
>> if (force_free || !rmid_dirty) {
>> clear_bit(idx, d->rmid_busy_llc);
>> if (!--entry->busy)
>> limbo_release_entry(entry);
>> }
>
>> cur_idx = idx + 1;
>
> Ideally this would be the last line in the loop, its how the iterator advances to the next
> bit in the bitmap.
>
>
>> + /* x86's CLOSID and RMID are independent numbers, so the entry's
>> + * closid is a invalid CLOSID. But on arm64, the RMID value isn't
>> + * a unique number for each CLOSID. It's necessary to track both
>> + * CLOSID and RMID because there may be dependencies between each
>> + * other on some architectures */
>> + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
>
> I agree outputting both these values is the right thing to do.
>
> resctrl_arch_rmid_read() could return an error, in which case val here is either
> uninitialised, or the value of another RMID.
> Could you put the tracepoint in the 'else' section of the if/else after
> resctrl_arch_rmid_read()? This way it will only output a value to user-space if it is correct.
>
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
>> index 495fb90c8572..35149a75c951 100644
>> --- a/arch/x86/kernel/cpu/resctrl/trace.h
>> +++ b/arch/x86/kernel/cpu/resctrl/trace.h
>> @@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
>> TP_printk("hits=%llu miss=%llu",
>> __entry->l3_hits, __entry->l3_miss));
>>
>> +TRACE_EVENT(mon_llc_occupancy_limbo,
>> + TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
>> + TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
>> + TP_STRUCT__entry(__field(u32, ctrl_hw_id)
>> + __field(u32, mon_hw_id)
>
>> + __field(int, id)
>
> Nit: Could we call this 'domain_id'? We've got two other ids already, so we should be
> clear what each one is!
>
>
>> + __field(u64, occupancy)),
>
> Nit: 'occupancy_bytes'? Just to avoid user-space thinking it needs to convert from the
> hardware unit in Intel's SDM ... and just in case we ever have to add another parameter
> that is sort of occupancy too.
>
>
>> + TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
>> + __entry->mon_hw_id = mon_hw_id;
>> + __entry->id = id;
>> + __entry->occupancy = occupancy;),
>> + TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
>> + __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
>> + );
>> +
>
> Tracepoints always expose some implicit details of the kernel internals which can make
> supporting them a headache. In this case - I've had some discussion with folk about future
> work to defer the limbo work as late as possible - until a new control or monitor group is
> allocated. This would be to avoid interrupting user-space tasks to update the limbo list
> when the result isn't needed until alloc time.
>
> In this case the tracepoint wouldn't be hit unless user-space made a mkdir() syscall to
> force an update - I think this can just be a documentation problem.
>
> I also don't think we should commit to this outputting values for all dirty RMID - which
> it does today. If group creation were to fail because there weren't any free RMID you'd
> get all the values, I think this is the only case where user-space would care.

Yes.

>
> Could we document the properties of the the trace-point that can be relied on in
> Documentation/arch/x86/resctrl.rst ?
>
> Something like:
> | This tracepoint gives you the precise occupancy values for a subset of RMID that are not
> | immediately available for allocation. This can't be relied on to produce output every
> | second, it may be necessary to attempt to create an empty monitor group to force an
> | update. Output may only be produced if creation of a control or monitor group fails.
>
> I think we'll always walk a list of dirty RMID before failing to allocate an RMID, so this
> should be future proof.
>
>
> With the val and documentation bits:
> Reviewed-by: James Morse <james.morse@xxxxxxx>
>
>
> Thanks,
>
> James

Thanks for your suggestions.