Re: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR

From: Sohil Mehta
Date: Thu Jul 28 2022 - 14:46:43 EST


Some nits below:

On 7/23/2022 1:37 AM, Yi Sun wrote:
Calculate the latency of instructions xsave and xrstor with new trace
points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.

The delta TSC can be calculated within a single trace event.

s/can be/is

Another
option considered was to have 2 separated trace events marking the
start and finish of the xsave/xrstor instructions. The delta TSC was
calculated from the 2 trace points in user space, but there was
significant overhead added by the trace function itself.


...


To enable it, CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required.
The compiler can get rid of all the extra crust when either of the two

s/can/will


configs is disabled.


If both of the two configs are enabled, xsave/xrstor_tracing_enabled

s/two/

would be reduced to a static check for tracing enabled. Thus, in the
fast path there would be only 2 additional static checks.

...

Leave the noise here instead of additional
conditions before calling the x86_fpu_latency_* because that makes the
latency more accurate and it's easy to filer the noise out by the

s/filer/filter

following consuming script.

...

+DECLARE_EVENT_CLASS(x86_fpu_latency,
+ TP_PROTO(struct fpstate *fpstate, u64 dtsc),

On x86, though TSC is the underlying counter for trace_clock(), it might be useful to keep this generic.

s/dtsc/latency

"latency" would match the printk below as well.

+ TP_ARGS(fpstate, dtsc),
+
+ TP_STRUCT__entry(
+ __field(struct fpstate *, fpstate)
+ __field(u64, dtsc)
+ __field(u64, rfbm)
+ __field(u64, xinuse)
+ ),
+
+ TP_fast_assign(
+ __entry->fpstate = fpstate;
+ __entry->dtsc = dtsc;
+ __entry->rfbm = fpstate->xfeatures;
+ __entry->xinuse = fpstate->regs.xsave.header.xfeatures;
+ ),
+
+ TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
+ __entry->dtsc,
+ __entry->rfbm,
+ __entry->xinuse
+ )
+);
+

-Sohil