Re: Summary of discussion following LPC2023 sframe talk

From: Peter Zijlstra
Date: Wed Nov 15 2023 - 10:49:45 EST


On Wed, Nov 15, 2023 at 10:09:16AM -0500, Mathieu Desnoyers wrote:
> Hi,
>
> [ With lkml and diamon-discuss in CC ]
>
> I'm adding the following notes of the hallway track discussion we had
> immediately after the sframe slot within the tracing MC [1]. I suspect it
> is relevant (please correct me if I'm wrong or if there are conclusions
> that are too early to tell):
>
> - Handling of shared libraries:
> - the libc dynamic loader should register/unregister sframe sections
> explicitly with new prctl(2) options,
> - The prctl() for registration of the sframe sections can take the
> section address and size as arguments,
> - The prctl for unregistration could take the section address as argument,
> but this would require additional data in the linker map (within libc),
> which is unwanted.
> - One alternative would be to provide an additional information to
> sframe registration/unregistration: a key which is decided by the libc
> to match registration/unregistration. That key could be either the
> address of the text section associated with the sframe section, or it
> could be the address of the linker map entry (at the choice of userspace).
> - Overall, the prctl(3) sframe register could have the following parameters:
> { key, sframe address, sframe section length }
> - The prctl(3) sframe unregister would then take a { key } as parameter.
>
> - The kernel backtrace code using the sframe information should consider
> it hostile:
> - can be corrupted by the application (by accident or maliciously),
> - can be corrupted on disk by modification of the ELF binary, either
> before registration or after (either by accident or maliciously),
> - can be malformed to contain loops (need to find a way to have upper
> bounds, sanity checks about the direction of the stack traversal),
>
> - It was discussed that the kernel could possibly validate checksums on
> registration and write-protect the sframe pages. Considering that the
> kernel still needs to consider the content hostile even with those
> mechanisms in place, it is unclear whether they are relevant.
>
> - Mark Rutland told me that for aarch64 the current sframe content is
> not sufficient to express how to walk the stack over code area at
> the beginning of functions before the stack pointer is updated.
> He plans to discuss this with Indu as a follow up.
>
> - Interpreters:
>
> - Walking over an interpreter's own stack can be as simple as skipping
> over the interpreter's runtime functions. This is a first step to
> allow skipping over interpreters without detailed information about
> their own stack layout.

Profiling interpreters is typically done using SIGnals. Perf is capable
of generating signals on overflow. This is slow, but so is an
interpreter. SIGhandler is part of the interpreter and can interpret
the interpreter state and do whatever it damn well pleases.

A stack-machine based interpreter will not have anything but the main
loop on the actual function call stack. Unwinding it using the 'C'
unwinder will yield nothing useful.

> - JITs:
>
> - There are two approaches to skip over JITted code stacks:
>
> - If the jitted code has frame pointers, then use this.
>
> - If we figure out that some JITs do not have frame pointers, then
> we would need to design a new kernel ABI that would allow JITs
> to express sframe-alike information. This will need to be designed
> with the input of JIT communities because some of them are likely
> not psABI compliant (e.g. lua has a separate stack).

Why a new interface? They can use the same prctl() as above. Here text,
there sframe.

> - When we have a good understanding of the JIT requirements in terms
> of frame description content, the other element that would need to
> be solved is how to allow JITs to emit frame data in a data structure
> that can expand. We may need something like a reserved memory area, with
> a counter of the number of elements which is used to synchronize communication
> between the JITs (producer) and kernel (consumer).

Again, huh?! Expand? Typical JIT has the normal epoch like approach to
text generation, have N>1 text windows, JIT into one until full, once
full, copy all still active crap into second window, induce grace period
and wipe first window, rince-repeat.

Just have a sframe thing per window and expand the definition of 'full'
to be either text of sframe window is full and everything should just
work, no?

>
> - We would need to figure out if JITs expect to have a single producer per
> frame description area, or multiple producers.

I've not really kept up and only ever seen single threaded JITs, but I
would imagine they each get their own window.

> - We would need to figure out if JITs expect to append frame descriptions in
> sorted function address order (append only for frame description, append only
> for functions text section as well), or if there needs to be support for unsorted
> function entries.

So from what I know they typically so the sorted thing, easier for their
own accounting too.

Note that there is this JAVA JIT text symbol userspace API thing that
tracks symbols. Perf-tool implements that IIRC. Writes it out to a file
which is then read and munged back into the report or so. IIRC this also
includes information on reclaim.

> - We would need information about how JITs reclaim functions, and how it impacts
> the frame description ABI. For instance, we may want to have a tombstone bit to
> state that a frame was deleted.

prctl() unregister + register ? I mean, JIT would need to be fully
co-operative anyway.

> - We may have to create frame description areas which content are specific to given
> JITs. For instance, the frame descriptions for a lua JIT on x86-64 may not follow
> the x86-64 regular psABI.
>
> - As an initial stage, we can focus on handling the sframe section for executable
> and shared objects, and use frame pointers to skip over JITted code if available.
> The goal here is to show the usefulness of this kind of information so we get
> the interest/collaboration needed to get the relevant input from JIT communities
> as we design the proper ABI for handling JIT frames.

As per: https://realpython.com/python312-perf-profiler/

There is some 'demand' for all this, might be useful to contact some JIT
authors and have them detail their needs or something.