Re: [PATCH] perf_events: fix broken intr throttling (v3)

From: Stephane Eranian
Date: Wed Feb 08 2012 - 06:43:44 EST


On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@xxxxxxxxx> wrote:
>
> Hi,
>
> On Thu, 26 Jan 2012 17:03:19 +0100
> Stephane Eranian <eranian@xxxxxxxxxx> wrote:
>
>> This patch fixes the throttling mechanism. It was broken
>> in 3.2.0. Events were not being unthrottled. The unthrottling
>> mechanism required that events be checked at each timer tick.
>
> This patch breaks perf on POWER. I haven't had a chance to work out why
> yet, but will investigate tomorrow.
>
Did you apply the subsequent patch I posted yesterday:

https://lkml.org/lkml/2012/2/7/185

> Anton
>
>> This patch solves this problem and also separates:
>> Â - unthrottling
>> Â - multiplexing
>> Â - frequency-mode period adjustments
>>
>> Not all of them need to be executed at each timer tick.
>>
>> This third version of the patch is based on my original patch +
>> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
>>
>> At each timer tick, for each context:
>> Â - if the current CPU has throttled events, we unthrottle events
>>
>> Â - if context has frequency-based events, we adjust sampling periods
>>
>> Â - if we have reached the jiffies interval, we multiplex (rotate)
>>
>> We decoupled rotation (multiplexing) from frequency-mode sampling
>> period adjustments. ÂThey should not necessarily happen at the same
>> rate. Multiplexing is subject to jiffies_interval (currently at 1
>> but could be higher once the tunable is exposed via sysfs).
>>
>> We have grouped frequency-mode adjustment and unthrottling into the
>> same routine to minimize code duplication. When throttled while in
>> frequency mode, we scan the events only once.
>>
>> We have fixed the threshold enforcement code in
>> __perf_event_overflow(). There was a bug whereby it would allow more
>> than the authorized rate because an increment of hwc->interrupts was
>> not executed at the right place.
>>
>> The patch was tested with low sampling limit (2000) and fixed periods,
>> frequency mode, overcommitted PMU.
>>
>> On a 2.1GHz AMD CPU:
>> $ cat /proc/sys/kernel/perf_event_max_sample_rate
>> 2000
>>
>> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
>>
>> $ perf record -e cycles,cycles -c 700000 Ânoploop 10
>> $ perf report -D | tail -21
>> Aggregated stats:
>> Â Â Â Â Â ÂTOTAL events: Â Â Â80086
>> Â Â Â Â Â Â MMAP events: Â Â Â Â 88
>> Â Â Â Â Â Â COMM events: Â Â Â Â Â2
>> Â Â Â Â Â Â EXIT events: Â Â Â Â Â4
>> Â Â Â Â THROTTLE events: Â Â Â19996
>> Â Â Â UNTHROTTLE events: Â Â Â19996
>> Â Â Â Â Â SAMPLE events: Â Â Â40000
>> cycles stats:
>> Â Â Â Â Â ÂTOTAL events: Â Â Â40006
>> Â Â Â Â Â Â MMAP events: Â Â Â Â Â5
>> Â Â Â Â Â Â COMM events: Â Â Â Â Â1
>> Â Â Â Â Â Â EXIT events: Â Â Â Â Â4
>> Â Â Â Â THROTTLE events: Â Â Â 9998
>> Â Â Â UNTHROTTLE events: Â Â Â 9998
>> Â Â Â Â Â SAMPLE events: Â Â Â20000
>> cycles stats:
>> Â Â Â Â Â ÂTOTAL events: Â Â Â39996
>> Â Â Â Â THROTTLE events: Â Â Â 9998
>> Â Â Â UNTHROTTLE events: Â Â Â 9998
>> Â Â Â Â Â SAMPLE events: Â Â Â20000
>>
>> For 10s, the cap is 2x2000x10 = 40000 samples.
>> We get exactly that: 20000 samples/event.
>>
>> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
>> ---
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0b91db2..412b790 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -589,6 +589,7 @@ struct hw_perf_event {
>> Â Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â sample_period;
>> Â Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â last_period;
>>    local64_t            period_left;
>> + Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â interrupts_seq;
>> Â Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â interrupts;
>>
>> Â Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â freq_time_stamp;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index de859fb..7c3b9de 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2300,6 +2300,9 @@ do { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
>> Â Â Â return div64_u64(dividend, divisor);
>> Â}
>>
>> +static DEFINE_PER_CPU(int, perf_throttled_count);
>> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
>> +
>> Âstatic void perf_adjust_period(struct perf_event *event, u64 nsec,
>> u64 count) {
>> Â Â Â struct hw_perf_event *hwc = &event->hw;
>> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct
>> perf_event *event, u64 nsec, u64 count) }
>> Â}
>>
>> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64
>> period) +/*
>> + * combine freq adjustment with unthrottling to avoid two passes
>> over the
>> + * events. At the same time, make sure, having freq events does not
>> change
>> + * the rate of unthrottling as that would introduce bias.
>> + */
>> +static void perf_adjust_freq_unthr_context(struct perf_event_context
>> *ctx,
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âint needs_unthr)
>> Â{
>> Â Â Â struct perf_event *event;
>> Â Â Â struct hw_perf_event *hwc;
>> - Â Â u64 interrupts, now;
>> + Â Â u64 now, period = TICK_NSEC;
>> Â Â Â s64 delta;
>>
>> - Â Â if (!ctx->nr_freq)
>> + Â Â /*
>> + Â Â Â* only need to iterate over all events iff:
>> + Â Â Â* - context have events in frequency mode (needs freq
>> adjust)
>> + Â Â Â* - there are events to unthrottle on this cpu
>> + Â Â Â*/
>> + Â Â if (!(ctx->nr_freq || needs_unthr))
>> Â Â Â Â Â Â Â return;
>>
>> + Â Â raw_spin_lock(&ctx->lock);
>> +
>> Â Â Â list_for_each_entry_rcu(event, &ctx->event_list,
>> event_entry) { if (event->state != PERF_EVENT_STATE_ACTIVE)
>> Â Â Â Â Â Â Â Â Â Â Â continue;
>> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct
>> perf_event_context *ctx, u64 period)
>> Â Â Â Â Â Â Â hwc = &event->hw;
>>
>> - Â Â Â Â Â Â interrupts = hwc->interrupts;
>> - Â Â Â Â Â Â hwc->interrupts = 0;
>> -
>> - Â Â Â Â Â Â /*
>> - Â Â Â Â Â Â Â* unthrottle events on the tick
>> - Â Â Â Â Â Â Â*/
>> - Â Â Â Â Â Â if (interrupts == MAX_INTERRUPTS) {
>> + Â Â Â Â Â Â if (needs_unthr && hwc->interrupts ==
>> MAX_INTERRUPTS) {
>> + Â Â Â Â Â Â Â Â Â Â hwc->interrupts = 0;
>> Â Â Â Â Â Â Â Â Â Â Â perf_log_throttle(event, 1);
>> Â Â Â Â Â Â Â Â Â Â Â event->pmu->start(event, 0);
>> Â Â Â Â Â Â Â }
>> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct
>> perf_event_context *ctx, u64 period) if (!event->attr.freq
>> || !event->attr.sample_freq) continue;
>>
>> - Â Â Â Â Â Â event->pmu->read(event);
>> + Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â* stop the event and update event->count
>> + Â Â Â Â Â Â Â*/
>> + Â Â Â Â Â Â event->pmu->stop(event, PERF_EF_UPDATE);
>> +
>> Â Â Â Â Â Â Â now = local64_read(&event->count);
>> Â Â Â Â Â Â Â delta = now - hwc->freq_count_stamp;
>> Â Â Â Â Â Â Â hwc->freq_count_stamp = now;
>>
>> + Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â* restart the event
>> + Â Â Â Â Â Â Â* reload only if value has changed
>> + Â Â Â Â Â Â Â*/
>> Â Â Â Â Â Â Â if (delta > 0)
>> Â Â Â Â Â Â Â Â Â Â Â perf_adjust_period(event, period, delta);
>> +
>> + Â Â Â Â Â Â event->pmu->start(event, delta > 0 ?
>> PERF_EF_RELOAD : 0); }
>> +
>> + Â Â raw_spin_unlock(&ctx->lock);
>> Â}
>>
>> Â/*
>> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct
>> perf_event_context *ctx) */
>> Âstatic void perf_rotate_context(struct perf_cpu_context *cpuctx)
>> Â{
>> - Â Â u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>> Â Â Â struct perf_event_context *ctx = NULL;
>> - Â Â int rotate = 0, remove = 1, freq = 0;
>> + Â Â int rotate = 0, remove = 1;
>>
>> Â Â Â if (cpuctx->ctx.nr_events) {
>> Â Â Â Â Â Â Â remove = 0;
>> Â Â Â Â Â Â Â if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>> Â Â Â Â Â Â Â Â Â Â Â rotate = 1;
>> - Â Â Â Â Â Â if (cpuctx->ctx.nr_freq)
>> - Â Â Â Â Â Â Â Â Â Â freq = 1;
>> Â Â Â }
>>
>> Â Â Â ctx = cpuctx->task_ctx;
>> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct
>> perf_cpu_context *cpuctx) remove = 0;
>> Â Â Â Â Â Â Â if (ctx->nr_events != ctx->nr_active)
>> Â Â Â Â Â Â Â Â Â Â Â rotate = 1;
>> - Â Â Â Â Â Â if (ctx->nr_freq)
>> - Â Â Â Â Â Â Â Â Â Â freq = 1;
>> Â Â Â }
>>
>> - Â Â if (!rotate && !freq)
>> + Â Â if (!rotate)
>> Â Â Â Â Â Â Â goto done;
>>
>> Â Â Â perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> Â Â Â perf_pmu_disable(cpuctx->ctx.pmu);
>>
>> - Â Â if (freq) {
>> - Â Â Â Â Â Â perf_ctx_adjust_freq(&cpuctx->ctx, interval);
>> - Â Â Â Â Â Â if (ctx)
>> - Â Â Â Â Â Â Â Â Â Â perf_ctx_adjust_freq(ctx, interval);
>> - Â Â }
>> -
>> - Â Â if (rotate) {
>> - Â Â Â Â Â Â cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> - Â Â Â Â Â Â if (ctx)
>> - Â Â Â Â Â Â Â Â Â Â ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>> + Â Â cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> + Â Â if (ctx)
>> + Â Â Â Â Â Â ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>>
>> - Â Â Â Â Â Â rotate_ctx(&cpuctx->ctx);
>> - Â Â Â Â Â Â if (ctx)
>> - Â Â Â Â Â Â Â Â Â Â rotate_ctx(ctx);
>> + Â Â rotate_ctx(&cpuctx->ctx);
>> + Â Â if (ctx)
>> + Â Â Â Â Â Â rotate_ctx(ctx);
>>
>> - Â Â Â Â Â Â perf_event_sched_in(cpuctx, ctx, current);
>> - Â Â }
>> + Â Â perf_event_sched_in(cpuctx, ctx, current);
>>
>> Â Â Â perf_pmu_enable(cpuctx->ctx.pmu);
>> Â Â Â perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> -
>> Âdone:
>> Â Â Â if (remove)
>> Â Â Â Â Â Â Â list_del_init(&cpuctx->rotation_list);
>> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>> Â{
>> Â Â Â struct list_head *head = &__get_cpu_var(rotation_list);
>> Â Â Â struct perf_cpu_context *cpuctx, *tmp;
>> + Â Â struct perf_event_context *ctx;
>> + Â Â int throttled;
>>
>> Â Â Â WARN_ON(!irqs_disabled());
>>
>> + Â Â __this_cpu_inc(perf_throttled_seq);
>> + Â Â throttled = __this_cpu_xchg(perf_throttled_count, 0);
>> +
>> Â Â Â list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
>> + Â Â Â Â Â Â ctx = &cpuctx->ctx;
>> + Â Â Â Â Â Â perf_adjust_freq_unthr_context(ctx, throttled);
>> +
>> + Â Â Â Â Â Â ctx = cpuctx->task_ctx;
>> + Â Â Â Â Â Â if (ctx)
>> + Â Â Â Â Â Â Â Â Â Â perf_adjust_freq_unthr_context(ctx,
>> throttled); +
>> Â Â Â Â Â Â Â if (cpuctx->jiffies_interval == 1 ||
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â !(jiffies %
>> cpuctx->jiffies_interval)) perf_rotate_context(cpuctx);
>> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct
>> perf_event *event, {
>> Â Â Â int events = atomic_read(&event->event_limit);
>> Â Â Â struct hw_perf_event *hwc = &event->hw;
>> + Â Â u64 seq;
>> Â Â Â int ret = 0;
>>
>> Â Â Â /*
>> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct
>> perf_event *event, if (unlikely(!is_sampling_event(event)))
>> Â Â Â Â Â Â Â return 0;
>>
>> - Â Â if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
>> - Â Â Â Â Â Â if (throttle) {
>> + Â Â seq = __this_cpu_read(perf_throttled_seq);
>> + Â Â if (seq != hwc->interrupts_seq) {
>> + Â Â Â Â Â Â hwc->interrupts_seq = seq;
>> + Â Â Â Â Â Â hwc->interrupts = 1;
>> + Â Â } else {
>> + Â Â Â Â Â Â hwc->interrupts++;
>> + Â Â Â Â Â Â if (unlikely(throttle
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â&& hwc->interrupts >=
>> max_samples_per_tick)) {
>> + Â Â Â Â Â Â Â Â Â Â __this_cpu_inc(perf_throttled_count);
>> Â Â Â Â Â Â Â Â Â Â Â hwc->interrupts = MAX_INTERRUPTS;
>> Â Â Â Â Â Â Â Â Â Â Â perf_log_throttle(event, 0);
>> Â Â Â Â Â Â Â Â Â Â Â ret = 1;
>> Â Â Â Â Â Â Â }
>> - Â Â } else
>> - Â Â Â Â Â Â hwc->interrupts++;
>> + Â Â }
>>
>> Â Â Â if (event->attr.freq) {
>> Â Â Â Â Â Â Â u64 now = perf_clock();
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at Âhttp://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at Âhttp://www.tux.org/lkml/
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/