RE: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth Monitoring (MBM) PMU

From: Juvva, Kanaka D
Date: Tue Sep 08 2015 - 13:06:22 EST


Hi Matt,

Here is a response to all your comments down below.

Thanks,
-Kanaka

> -----Original Message-----
> From: Fleming, Matt
> Sent: Tuesday, September 8, 2015 4:47 AM
> To: Juvva, Kanaka D
> Cc: Thomas Gleixner; Kanaka Juvva; Williamson, Glenn P; Auld, Will; Andi Kleen;
> LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x86@xxxxxxxxxx; Ingo Molnar; H.
> Peter Anvin; Shivappa, Vikas
> Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth Monitoring
> (MBM) PMU
>
> On Mon, 2015-09-07 at 20:22 +0100, Juvva, Kanaka D wrote:
> > Hi Thomas,
> >
> > I'm sending updated patch(s). I have given details for each of
> > these items below.
> >
>
> Kanaka, this email is HTML formatted and so has been blocked by
> vger.kernel.org where the linux-kernel mailing list is hosted.
>
> Please configure outlook not to send html email, or use a different mail agent
> for working with upstream.
>

Done.

> > Regards,
> > -Kanaka
> >
> > > -----Original Message-----
> > > From: Thomas Gleixner [mailto:tglx@xxxxxxxxxxxxx]
> > > Sent: Wednesday, August 19, 2015 1:50 PM
> > > To: Kanaka Juvva
> > > Cc: Juvva, Kanaka D; Williamson, Glenn P; Fleming, Matt; Auld, Will;
> > Andi Kleen;
> > > LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x86@xxxxxxxxxx; Ingo
> > Molnar; H.
> > > Peter Anvin; Shivappa, Vikas
> > > Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth
> > Monitoring
> > > (MBM) PMU
> > >
> > > On Fri, 7 Aug 2015, Kanaka Juvva wrote:
> > > > +#define MBM_CNTR_MAX 0xffffff
> > > > +#define MBM_SOCKET_MAX 8
> > > > +#define MBM_TIME_DELTA_MAX 1000
> > > > +#define MBM_TIME_DELTA_MIN 100
> > >
> > > What are these constants for and how are they determined? Pulled out
> > > of thin air?
> > >
> >
> > /*
> > * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
> > * value
> > */
> > #define MBM_CNTR_MAX 0xffffff
> > /*
> > * Max #sockets supported
> > */
> > #define MBM_SOCKET_MAX 8
>
> This seems like a constant we could get by without. Do we really need to know
> this at compile time?
>

I spoke to Andi Kleen regarding how many sockets. We decided we could keep 8.


> > /*
> > * Expected time interval between consecutive MSR reads for a given rmid
> > */
> > #define MBM_TIME_DELTA_MAX 1000
>
> "max" and "expected" are not the same thing.
>
>

This is being changed to #define MBM_TIME_DELTA_EXP 1000

> > > > #define QOS_L3_OCCUP_EVENT_ID (1 << 0)
> > > > +#define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
> > > > +#define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
> > > > +#define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
> > >
> > > So we have ID values which are built with (1 << X) and then this HW
> > > variant in the middle with 0x3. Of course without any explanation what the
> heck this stuff is.
> > >
> > > Last review:
> > >
> > > "So this wants a descriptive ID name and a comment."
> > >
> > >
> >
> > /*
> > * MBM Event IDs as defined in SDM section 17.14.6
> > * Event IDs used to program MSRs for reading counters
> > */
> > #define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
> > #define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
> > /*
> > * Perf needs event id to be 1 << x, hence we can't use 0x3 (HW EVENT ID)
> > * for MBM_LOCAL_EVENT we use next 1 << x for MBM_LOCAL_EVENT_ID
> > */
> > #define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
>
> No, perf events do not need to be of the form (1 << X), that was just a
> convention we used in the cqm code before we knew what values the MBM
> events would take - you can change these to be whatever format you want, but
> be sure to make it consistent.
>
> The constants are very much supposed to be programmed into the MSRs, take a
> look at __rmid_read().
>
> I would suggest (as I already did privately) that you change the format to be
> 0x0x for all of these event IDs.
>
>

There are two aspects:

1) Programming MSRs
2) EVENT_ATTR_STR(llc_local_bw, intel_cqm_llc_local_bw, "event=0x04");

1 is used for programming MSRs
2 event attribute for perf


For MBM_LOCAL_EVENT HW ID is 0x3. We don't want to use 0x3 for EVENT ATTR.

If we use 0x3 for event_attribute

We can't clearly distinguish whether is EVENT 01 & EVENT 02 or EVENT 03 alone.
For perf event attribute it has to be 0x04. Because 0x01 and 0x02 are used for other two events

> > > > @@ static bool intel_cqm_sched_in_event(u32 rmid)
> > > > return false;
> > > > }
> > > >
> > > > +
> > > > +static u32 bw_sum_calc(struct sample *bw_stat, int rmid) {
> > > > + u32 val = 0, i, j, index;
> > > > +
> > > > + if (++bw_stat->fifoout >= mbm_window_size)
> > > > + bw_stat->fifoout = 0;
> > > > + index = bw_stat->fifoout;
> > > > + for (i = 0; i < mbm_window_size - 1; i++) {
> > > > + if (index + i >= mbm_window_size)
> > > > + j = index + i - mbm_window_size;
> > > > + else
> > > > + j = index + i;
> > > > + val += bw_stat->mbmfifo[j];
> > > > + }
> > >
> > > This math wants a explanatory comment.
> > >
> > /*
> > * Slide the window by 1 and calculate the sum of the last
> > * mbm_window_size-1 bandwidth values.
> > * fifoout is the current position of the window.
> > * Increment the fifoout by 1 to slide the window by 1.
> > *
> > * Calcalute the bandwidth from ++fifiout to ( ++fifoout + mbm_window_size -
> 1)
> > * e.g.fifoout =1; Bandwidth1 Bandwidth2 ..... Bandwidthn are the
> > * sliding window values where n is size of the sliding window
> > * bandwidth sum: val = Bandwidth2 + Bandwidth3 + .. Bandwidthn
> > */
>
> Instead of these large comment blocks please comment smaller, logically-
> connected chunks of code, e.g.
>

OK


> /* Slide the window by one */
> if (++bw_stat->fifoout >= mbm_window_size)
> bw_stat->fifoout = 0;
>
> /*
> * Calculate the sum of last mbm_window_size-1 values.
> */
> for (i = 0; i < mbm_window_size - 1; i++) {
> /* Handle wraparound at end of window */
> if (index + i >= mbm_window_size)
> j = index + i - mbm_window_size;
> else
> j = index + i;
>
> val += bw_stat->mbminfo[j];
> }
>
> >
> >
> > > > + return val;
> > > > +}
> > > > +
> > > > +static u32 __mbm_fifo_sum_lastn_out(int rmid, bool is_localbw) {
> > > > + if (is_localbw)
> > > > + return bw_sum_calc(&mbm_local[rmid], rmid);
> > > > + else
> > > > + return bw_sum_calc(&mbm_total[rmid], rmid); }
> > > > +
> > > > +static void __mbm_fifo_in(struct sample *bw_stat, u32 val) {
> > > > + bw_stat->mbmfifo[bw_stat->fifoin] = val;
> > > > + if (++bw_stat->fifoin >= mbm_window_size)
> > >
> > > How does that become greater than mbm_windowsize?
> > >
> >
> > This is fixed by changing >= to ==
> > Added a comment:
> >
> > /*
> > * store current sample's bw value in sliding window at the
> > * index fifoin. Increment fifoin. Check if fifoin has reached
> > * max_window_size. If yes reset it to begining i.e. zero
> > * e.g.
> > * mbm_window_size = 10
> > * mbmfifo is a circular fifo 0 1 2 3 4 5 6 7 8 9 10
> > * ^ |
> > * | |
> > * | _ _ _ _ _ _ _ _ _ _|
> > *
> > * So when fifoin becomes 10, then it is reset to zero
> > *
> > */
>
> I'm not sure that this comment adds anything of value that isn't already
> understood by reading the code. I don't think you need a comment for this
> function, it seems pretty straight forward and Thomas' question was about the
> boundary limits of ->fifoin.
>

Thomas question was addressed by the change in if statement i.e '==''
I'll take out the comment.

> > > > + bw_stat->fifoin = 0;
> > > > +}
> > >
> > > > +/*
> > > > + * __rmid_read_mbm checks whether it is LOCAL or GLOBAL MBM event
> > > > +and reads
> > > > + * its MSR counter. Check whether overflow occurred and handles it.
> > > > +Calculates
> > > > + * currenet BW and updates running average.
> > >
> > > currenet? And please get rid of the double spaces
> > >
> > This is fixed now. Here is the updated comment:
> >
> > /*
> > * rmid_read_mbm checks whether it is LOCAL or Total MBM event and reads
> > * its MSR counter. Check whether overflow occured and handles it. Calculates
> > * currenet BW and updates running average.
> > *
>
> ^^^ You've still misspelled current.

Fixed

> >
> > * Overflow Handling:
> > * if (MSR current value < MSR previous value) it is an
> > * overflow. MSR values are increasing when bandwidth consumption for the
> thread
> > * is non-zero; When MSR values reaches MAX_COUNTER_VALUE it overflows.
> After overflow,
> > * MSR current value goes back to zero and starts increasing again at the rate
> of
> > * bandwidth.
> > *
>

How overflow handling is down below. I'll change this as per the comments

> You don't need to provide a definition of "overflow", most people will be
> familiar with it. What is more important to document is how the overflow is
> handled...
> >
> > * Overflow handling:
> > * Detect an overflow : current read value > last read value
>
> Isn't this inverted? Overflow occurred if current < previous.
>
> >
> > * Overflow correction: if (overflow)
> > * Current value = (MAX_COUNTER_VALUE - prev read value) +
> current read value
> > * else
> > * Current value = current read value
> > *
>
> Please don't write pseudocode in the comments. Use English prose to describe
> the important parts of the code.
>
OK

> >
> > * Calculation of Current Bandwidth value:
> > * If MSR is read within last 100ms, then then the smaple is ignored;
> > * If the MSR was Read with in last 100ms, why incur an extra overhead
> > * of doing the MSR reads again. Anyway there'll be a negligible change or zero
> > * change in MSR readings in 100ms.
> > *
> > * Bandwidth is calculated as:
> > * memory bandwidth = difference of last two msr counter values/time
> difference.
> > *
> > * cum_avg = Running Average bandwidth of last 'n' bandwidth values for
> > * the samples that are processed
> > *
>
> Where 'n' is 'mbm_window_size' ? If so, please use 'mbm_window_size', not 'n'.

OK


> >
> > * Sliding window is used to save the last 'n' samples. Where,
> > * n = sliding_window_size and results in sliding window duration of 'n' secs.
>
> Hmm... this confuses me a lot. Is 'n' a size or a duration? The two are not the
> same thing.
>
Actually they have same values. Anyways I'll change the comment to talk about
duration. That way it will not be a confusion.

> >
> > * The sliding window size by default set to
> > * MBM_FIFO_SIZE_MIN. User can configure it to the values in the range
> > * (MBM_FIFO_SIZE_MIN,MBM_FIFO_SIZE_MAX). The range for sliding window
> > * is chosen based on a general criteria for monitoring duration. Example
> > * for a short lived application, 10sec monitoring period gives
> > * good characterization of its bandwidth consumption. For an application
> > * that runs for longer duration, 300sec monitoring period gives better
> > * characterization of its bandwidth consumption. Since the running average
> > * calculated for total monitoring period, user gets the most accuracate
> > * average bandwidth for the each monitoring period.
> > *
> > * Scaling:
> > * cum_avg is the raw bandwidth is Bytes/sec.
> > * cum_avg is converted to MB/sec by applying MBM_CONVERSION_FACTOR
> and
> > * rounded to nearest integer. User interface gets the Bandwidth values in
> MB/sec.
> > *
> > */
> > > > + *
> > > > + * Overflow Handling:
> > > > + * if (MSR current value < MSR previous value) it is an
> > > > + * overflow. and overflow is handled.
> > >
> > > Wow. That's informative as hell!
> > >
> > Please look at the modified comment above
> > > > + *
> > > > + * Calculation of Current BW value:
> > >
> > > BW == Body Weight?
> > >
> >
> > It is fixed now
> >
> > > > + * If MSR is read within last 100ms, then the value is ignored;
> > > > + * this will suppress small deltas. We don't process MBM samples
> > > > + that are
> > > > + * within 100ms.
> > >
> > > WHY?
> > >
> > Explained in the comment. If mbm_read is called within in 100ms for
> > the same rmid, we donât have to process the sample.
>
> The key piece of information you're missing here is that skipping these small
> deltas is an optimization, because we avoid performing costly operations for
> what would likely be a very minor change in the MBM data, right?
>
>

Yes, You are correct.

> > > > +{
> > > > + u64 val, tmp, diff_time, cma, bytes, index;
> > > > + bool overflow = false, first = false;
> > > > + ktime_t cur_time;
> > > > + u32 tmp32 = rmid;
> > > > + struct sample *mbm_current;
> > > > + u32 vrmid = topology_physical_package_id(smp_processor_id()) *
> > > > + cqm_max_rmid + rmid;
> > > > +
> > > > + rmid = vrmid;
> > >
> > > From my previous review:
> > >
> > > "This is completely backwards.
> > >
> > > tmp32 = rmid;
> > > rmid = vrmid;
> > > do_stuff(rmid);
> > > rmid = tmp32;
> > > do_other_stuff(rmid);
> > >
> > > Why can't you use vrmid for do_stuff() and leave rmid alone? Just
> > > because it would make the code simpler to read?"
> > >
> > > Still applies.
> > >
> >
> > This is now changed to
> > u64 val, currentmsr, currentbw, diff_time, cma, bytes, index;
> > bool overflow = false, first = false;
> > ktime_t cur_time;
> > u32 tmp32 = rmid, eventid;
> > struct sample *mbm_current;
> > u32 vrmid = rmid_2_index(rmid);
> >
> > rmid = vrmid;
> > cur_time = ktime_get();
> > if (read_mbm_local) {
> > mbm_current = &mbm_local[vrmid];
> > eventid = QOS_MBM_LOCAL_EVENT_ID_HW;
> > wrmsr(MSR_IA32_QM_EVTSEL, QOS_MBM_LOCAL_EVENT_ID_HW,
> > rmid);
>
> You don't need to perform this wrmsr() here because it's taken care of in the
> common code below.
>

OK

> > } else {
> > mbm_current = &mbm_total[vrmid];
> > eventid = QOS_MBM_TOTAL_EVENT_ID;
> > }
> > rmid = tmp32;
>
> Why did you assign rmid to vrmid if you reassign it before it was used?
>
>

For MSR writes we use rmid value and for mbm_* arrary we use vrmid which is actual
index.

> > > > + /* if current msr value < previous msr value , it means overflow */
> > > > + if (val < bytes) {
> > > > + val = MBM_CNTR_MAX - bytes + val;
> > > > + overflow = true;
> > > > + } else
> > > > + val = val - bytes;
> > > > +
> > > > + val = (val * MBM_TIME_DELTA_MAX) / diff_time;
> > > > +
> > > > + if ((diff_time > MBM_TIME_DELTA_MAX) && (!cma))
> > > > + /* First sample */
> > > > + first = true;
> > > > +
> > > > + rmid = vrmid;
> > >
> > > And another time:
> > >
> > > "More obfuscation"
> > >
> >
> > /*
> > * MBM_TIME_DELTA_MAX is picked as per MBM specs. As specified in
> Intel Platform
> > * Quality of Service Monitoring Implementer's Guide V1, Section 2.7.2.
> page 21,
> > * overflow can occur maximum once in a second. So latest we want to
> read the MSR
> > * counters is 1000ms. If it is less than 1000ms we can ignore the sample.
> Then we
> > * decide since when we should ignore. If the MSR was Read with in last
> 100ms, why
> > * process the MSR reads again. Anyway there'll be small change or zero
> change.
> > * So ignoring MSR Reads within 100ms or less is efficient.
> MBM_TIME_DELTA_MIN
> > * is specified as 100ms as per this guideline.
> > *
> > */
>
> I suspect the document you're referring to above is only available under NDA,
> which makes it unsuitable for mention in the kernel source since a large number
> of people won't have access to it.
>
> Just explain that the way the hardware is designed puts an upper limit on how
> quickly the counter can overflow, which is once per second.
>
>

OK. I'll change this to "as per hardware functionality"

> > > > +static void __intel_cqm_event_total_bw_count(void *info) {
> > > > + struct rmid_read *rr = info;
> > > > + u64 val;
> > > > +
> > > > + val = __rmid_read_mbm(rr->rmid, false);
> > > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > > > + return;
> > > > + atomic64_add(val, &rr->value); }
> > > > +
> > > > +static void __intel_cqm_event_local_bw_count(void *info) {
> > > > + struct rmid_read *rr = info;
> > > > + u64 val;
> > > > +
> > > > + val = __rmid_read_mbm(rr->rmid, true);
> > > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > > > + return;
> > > > + atomic64_add(val, &rr->value); }
> > >
> > > And once more:
> > >
> > > "You're really a fan of copy and paste."
> > >
> >
> > These functions are invoked indirectly. They were written keeping
> intel_cqm_event_count in mind.
> > Iâll change the arg to struct mbm_read{
> > struct rmid_read *rr;
> > u32 eventid;
> > };
> > Intel_cqm_event_*_bw_count(â.) needs eventid to call for decoding
>
> No, please do not duplicate the rmid_read structure, that is not an
> improvement, we don't need two different structs for reading the read data.
>
> Please add the event field to the existing struct rmid_read.
>
>

OK


> > > > @@ -1023,6 +1437,17 @@ static void intel_cqm_event_stop(struct
> > > perf_event *event, int mode)
> > > > } else {
> > > > WARN_ON_ONCE(!state->rmid);
> > > > }
> > > > +
> > > > + if (pmu) {
> > > > + if (pmu->n_active > 0) {
> > >
> > > What's the purpose of this check? In the previous version there was
> > > a WARN_ON(), which made sense. Did it trigger and you decided to "work"
> > > around it?
> > >
> >
> > We actually meant to check if there are active events
>
> I don't follow this answer. Are you saying that the WARN_ON() doesn't make
> sense here?
>
>
>
I can add WARN_ON. But this will always hit if there are no events.
> >
> > > > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit,
> > > > +"KB/sec"); EVENT_ATTR_STR(llc_local_bw.unit,
> > > > +intel_cqm_llc_local_bw_unit, "KB/sec"); #endif
> > >
> > > > +static ssize_t
> > > > +sliding_window_size_store(struct device *dev,
> > > > + struct device_attribute *attr,
> > > > + const char *buf, size_t count) {
> > > > + unsigned int bytes;
> > > > + int ret;
> > > > +
> > > > + ret = kstrtouint(buf, 0, &bytes);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + mutex_lock(&cache_mutex);
> > > > + if (bytes > 0 && bytes <= MBM_FIFO_SIZE_MAX)
> > > > + mbm_window_size = bytes;
> > >
> > > So, it's valid to set the window to X where 0 < X < MBM_FIFO_SIZE_MIN.
> > > What's the actual purpose of MBM_FIFO_SIZE_MIN?
> > >
> > This is changed to
> > if (bytes >= MBM_FIFO_SIZE_MIN && bytes <= MBM_FIFO_SIZE_MAX)
> > mbm_window_size = bytes;
>
> Note that if the user passes a value outside of this range you should be returning
> -EINVAL to indicate that.
>
>
OK
> > > > + pmu->timer_interval = ms_to_ktime(MBM_TIME_DELTA_MAX);
> > > > + per_cpu(mbm_pmu, cpu) = pmu;
> > > > + per_cpu(mbm_pmu_to_free, cpu) = NULL;
> > >
> > > What's the point of this? If there is still something to be free'd its leaked.
> > > Otherwise that's redundant.
> > per_cpu(mbm_pmu_to_free, cpu) = NULL; is removed
> >
> > > > + mbm_hrtimer_init(pmu);
> > > > + }
> > > > + return 0;
> > >
> > > s/0/NOTIFY_OK/ because you return that value directly.
> > >
> > You mean I return the âreturn codeâ
>
> ?
>
> You should be using NOTIFY_OK here so that you follow the notifier API
> convention.
>
OK

N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå