Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

From: Alexander Shishkin
Date: Fri Sep 23 2016 - 10:35:59 EST


Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote:
>> In order to be able to allocate perf ring buffers in non-mmap path, we
>> need to make sure we can still account the memory to the user and that
>> they don't exceed their mlock limit.
>>
>> This patch moves ring buffer memory accounting down the rb_alloc() path
>> so that its callers won't have to worry about it. This also serves the
>> additional purpose of slightly cleaning up perf_mmap().
>
> While I like a cleanup of that code (it really can use one), I'm not a
> big fan of hidden buffers like this. Why is this needed?

So what I wanted is a similar interface to call stack sampling or pretty
much anything else sampling that we have at the moment. The user would
ask for AUX samples of, say, intel_pt, and would get a sample with PT
stuff right in the perf buffer every time their main event overflows.

They don't *need* to know that we have a kernel event with a ring buffer
under the hood. This was one of the use cases of 'hidden' ring
buffers. The other two are process core dump and system core dump ([1]
tried to do it without involving perf at all, for reference).

> A quick look through the patches also leaves me wondering on the design
> and interface of this thing. A few words explaining the overall design
> would be nice.

Right; here goes. PERF_SAMPLE_AUX is set in the attr.sample_type of the
event that you want to sample. Then, using that event's
attr.aux_sample_type as the PMU 'type' and attr.aux_sample_config as
'config' we create a kernel event. For this kernel event, we then
allocate a ring buffer with 0 data pages and as many aux pages as would
fit the attr.aux_sample_size.

Then, we hook into the perf_prepare_sample()/perf_output_sample() path
so that When the original event goes off, we first stop the kernel
event, then memcpy the data from the 'hidden' aux buffer into the
original event's perf buffer under PERF_SAMPLE_AUX and then restart the
kernel event. This all is happening on the local cpu. The 'hidden' aux
buffer is running in overwrite mode, so we copy attr.aux_sample_size
bytes every time, which means there may be overlaps between samples, but
the tooling has logic to handle this.

This is about it. Before creating a new counter we first look for an
existing one that fits the bill wrt filtering bits; if there is one, we
grab its reference and use it instead. This is so that one could do
things like

$ perf record -Aintel_pt -e 'cycles,instructions,branch-misses' ls

or

$ perf record -Aintel_pt -e 'sched:*' -a sleep 10

> Afaict there's no actual need to hide the AUX buffer for this sampling
> stuff; the user knows about all this and can simply mmap() the AUX part.

Yes, you're right here. We could also re-use the AUX record, adding a
new flag for this. It may be even better if I can work out the
inheritance (the current code doesn't handle inheritance at the moment
in case we decide to scrap it).

> The sample could either point to locations in the AUX buffer, or (as I
> think this code does) memcpy bits out.

Yes and yes, it does.

> Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> the find_aux_event crud. I'm not sure we want to overload the group_fd
> thing more (its already very hard to create counter groups in a cgroup
> for example) ..

It can be also stuffed into the attribute or ioctl()ed. The latter is
probably the best.

> Coredump was mentioned somewhere, but I'm not sure I've seen
> code/interfaces for that. How was that envisioned to work?

Ok, so what I have is a new RLIMIT_PERF, which is set to the aux data
sample to be included in the [process] core dump. At the
prlimit(RLIMIT_PERF) time, given that RLIMIT_CORE is also nonzero, I
create a kernel event with a 'hidden' buffer. The PMU for this event is,
in this scenario, a system-wide setting, which is a tad iffy, seeing as
we now have 2 PMUs in the system that can be used for this, but which
are mutually exclusive.

Now, when the core dump is written, we check if there's such an event on
the task's perf context and if there is, we dump_emit() data from the
hidden buffer into the file. The difference with sampling is that this
kernel event is also inheritable, so that when the task fork()s, a new
event is created. The memory is counted against
sysctl_perf_event_mlock+user's RLIMIT_MEMLOCK (just like the rest of
perf buffers), so when the user is out of it, no new events are created.

The rlimit as interface to enable this seems weirder the more I look at
it, which is also the reason why I haven't sent it out yet. The other
ideas I had for this were a prctl(), which would be more
straightforward, would also allow to specify the PMU, but, unlike
prlimit() would only work on the current process. Yet another way would
be to go through perf_event_open() and then somehow feed the event into
the ether instead of polling it.

The last one that can use the hidden buffer is system core dumps, that
would be either retreived by kdump or stored in pstore/EFI capsule. I
don't have the code for this yet, but the general idea is that per-cpu
AUX events would start at boot time in overwrite mode and just hang in
there till things go south.

[1] http://marc.info/?l=linux-kernel&m=143814616805933

Thanks,
--
Alex