Re: [PATCH RFC 0/4] Add support for synchronous signals on perf events

From: Marco Elver
Date: Tue Feb 23 2021 - 17:30:16 EST


On Tue, 23 Feb 2021 at 21:27, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > On Feb 23, 2021, at 6:34 AM, Marco Elver <elver@xxxxxxxxxx> wrote:
> >
> > The perf subsystem today unifies various tracing and monitoring
> > features, from both software and hardware. One benefit of the perf
> > subsystem is automatically inheriting events to child tasks, which
> > enables process-wide events monitoring with low overheads. By default
> > perf events are non-intrusive, not affecting behaviour of the tasks
> > being monitored.
> >
> > For certain use-cases, however, it makes sense to leverage the
> > generality of the perf events subsystem and optionally allow the tasks
> > being monitored to receive signals on events they are interested in.
> > This patch series adds the option to synchronously signal user space on
> > events.
>
> Unless I missed some machinations, which is entirely possible, you can’t call force_sig_info() from NMI context. Not only am I not convinced that the core signal code is NMI safe, but at least x86 can’t correctly deliver signals on NMI return. You probably need an IPI-to-self.

force_sig_info() is called from an irq_work only: perf_pending_event
-> perf_pending_event_disable -> perf_sigtrap -> force_sig_info. What
did I miss?

> > The discussion at [1] led to the changes proposed in this series. The
> > approach taken in patch 3/4 to use 'event_limit' to trigger the signal
> > was kindly suggested by Peter Zijlstra in [2].
> >
> > [1] https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-XQ@xxxxxxxxxxxxxx/
> > [2] https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >
> > Motivation and example uses:
> >
> > 1. Our immediate motivation is low-overhead sampling-based race
> > detection for user-space [3]. By using perf_event_open() at
> > process initialization, we can create hardware
> > breakpoint/watchpoint events that are propagated automatically
> > to all threads in a process. As far as we are aware, today no
> > existing kernel facility (such as ptrace) allows us to set up
> > process-wide watchpoints with minimal overheads (that are
> > comparable to mprotect() of whole pages).
>
> This would be doable much more simply with an API to set a breakpoint. All the machinery exists except the actual user API.

Isn't perf_event_open() that API?

A new user API implementation will either be a thin wrapper around
perf events or reinvent half of perf events to deal with managing
watchpoints across a set of tasks (process-wide or some subset).

It's not just breakpoints though.

> > [3] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf
> >
> > 2. Other low-overhead error detectors that rely on detecting
> > accesses to certain memory locations or code, process-wide and
> > also only in a specific set of subtasks or threads.
> >
> > Other example use-cases we found potentially interesting:
> >
> > 3. Code hot patching without full stop-the-world. Specifically, by
> > setting a code breakpoint to entry to the patched routine, then
> > send signals to threads and check that they are not in the
> > routine, but without stopping them further. If any of the
> > threads will enter the routine, it will receive SIGTRAP and
> > pause.
>
> Cute.
>
> >
> > 4. Safepoints without mprotect(). Some Java implementations use
> > "load from a known memory location" as a safepoint. When threads
> > need to be stopped, the page containing the location is
> > mprotect()ed and threads get a signal. This can be replaced with
> > a watchpoint, which does not require a whole page nor DTLB
> > shootdowns.
>
> I’m skeptical. Propagating a hardware breakpoint to all threads involves IPIs and horribly slow writes to DR1 (or 2, 3, or 4) and DR7. A TLB flush can be accelerated using paravirt or hypothetical future hardware. Or real live hardware on ARM64.
>
> (The hypothetical future hardware is almost present on Zen 3. A bit of work is needed on the hardware end to make it useful.)

Fair enough. Although watchpoints can be much more fine-grained than
an mprotect() which then also has downsides (checking if the accessed
memory was actually the bytes we're interested in). Maybe we should
also ask CPU vendors to give us better watchpoints (perhaps start with
more of them, and easier to set in batch)? We still need a user space
API...

Thanks,
-- Marco



> >
> > 5. Tracking data flow globally.
> >
> > 6. Threads receiving signals on performance events to
> > throttle/unthrottle themselves.