Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

From: Jin, Yao
Date: Wed Jun 28 2017 - 01:36:07 EST


Hi Kyle,

I understand your requirement. Sorry I don't know the detail of rr debugger, but I guess if it just uses counter overflow to deliver a signal, could it set the counter without "exclude_kernel"?

Another consideration is, I'm not sure if the kernel can accurately realize that if the interrupt is to trigger sampling or just deliver a signal. Userspace may know that but kernel may not.

Anyway let's listen to more comments from community.

Thanks

Jin Yao


On 6/28/2017 12:51 PM, Kyle Huey wrote:
On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@xxxxxxxxxxxxxxx> wrote:
Hi,

In theory, the PMI interrupts in skid region should be dropped, right?
No, why would they be dropped?

My understanding of the situation is as follows:

There is some time, call it t_0, where the hardware counter overflows.
The PMU triggers an interrupt, but this is not instantaneous. Call
the time when the interrupt is actually delivered t_1. Then t_1 - t_0
is the "skid".

Note that if the counter is `exclude_kernel`, then at t_0 the CPU
*must* be running a userspace program. But by t_1, the CPU may be
doing something else. Your patch changed things so that if at t_1 the
CPU is in the kernel, then the interrupt is discarded. But rr has
programmed the counter to deliver a signal on overflow (via F_SETSIG
on the fd returned by perf_event_open). This change results in the
signal never being delivered, because the interrupt was ignored.
(More accurately, the signal is delivered the *next* time the counter
overflows, which is far past where we wanted to inject our
asynchronous event into our tracee.

It seems to me that it might be reasonable to ignore the interrupt if
the purpose of the interrupt is to trigger sampling of the CPUs
register state. But if the interrupt will trigger some other
operation, such as a signal on an fd, then there's no reason to drop
it.

For a userspace debugger, is it the only choice that relies on the *skid*
PMI interrupt?
I don't understand this question, but hopefully the above clarified things.

- Kyle

Thanks
Jin Yao


On 6/28/2017 9:01 AM, Kyle Huey wrote:
Sent again with LKML CCd, sorry for the noise.

- Kyle

On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@xxxxxxxxxxxx> wrote:
cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
a candidate for backporting to stable branches.

rr, a userspace record and replay debugger[0], uses the PMU interrupt
to stop a program during replay to inject asynchronous events such as
signals. We are counting retired conditional branches in userspace
only. This changeset causes the kernel to drop interrupts on the
floor if, during the PMU interrupt's "skid" region, the CPU enters
kernel mode for whatever reason. When replaying traces of complex
programs such as Firefox, we intermittently fail to deliver
asynchronous events on time, leading the replay to diverge from the
recorded state.

It seems like this change should, at a bare minimum, be limited to
counters that actually perform sampling of register state when the
interrupt fires. In our case, with the retired conditional branches
counter restricted to counting userspace events only, it makes no
difference that the PMU interrupt happened to be delivered in the
kernel.

As this makes rr unusable on complex applications and cannot be
efficiently worked around, we would appreciate this being addressed
before 4.12 is finalized, and the regression not being introduced to
stable branches.

Thanks,

- Kyle

[0] http://rr-project.org/