Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints

From: Marco Elver
Date: Fri Jun 10 2022 - 05:38:04 EST


On Fri, 10 Jun 2022 at 11:04, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Thu, 9 Jun 2022 at 20:37, Marco Elver <elver@xxxxxxxxxx> wrote:
> > > /On Thu, 9 Jun 2022 at 16:56, Marco Elver <elver@xxxxxxxxxx> wrote:
> > > > > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > > > > benchmark results in:
> > > > > >
> > > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > > > > | # Running 'breakpoint/thread' benchmark:
> > > > > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > > > > | Total time: 236.418 [sec]
> > > > > > |
> > > > > > | 123134.794271 usecs/op
> > > > > > | 7880626.833333 usecs/op/cpu
> > > > > >
> > > > > > The benchmark tests inherited breakpoint perf events across many
> > > > > > threads.
> > > > > >
> > > > > > Looking at a perf profile, we can see that the majority of the time is
> > > > > > spent in various hw_breakpoint.c functions, which execute within the
> > > > > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > > > > mutex as well:
> > > > > >
> > > > > > 37.27% [kernel] [k] osq_lock
> > > > > > 34.92% [kernel] [k] mutex_spin_on_owner
> > > > > > 12.15% [kernel] [k] toggle_bp_slot
> > > > > > 11.90% [kernel] [k] __reserve_bp_slot
> > > > > >
> > > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > > > > O(#tasks) due to storing all task breakpoints in the same list and
> > > > > > iterating through that list looking for a matching task. Clearly, this
> > > > > > does not scale to thousands of tasks.
> > > > > >
> > > > > > While one option would be to make task_struct a breakpoint list node,
> > > > > > this would only further bloat task_struct for infrequently used data.
> > > > >
> > > > > task_struct already has:
> > > > >
> > > > > #ifdef CONFIG_PERF_EVENTS
> > > > > struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > > > > struct mutex perf_event_mutex;
> > > > > struct list_head perf_event_list;
> > > > > #endif
> > > > >
> > > > > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > > > > And possibly perf_event_list instead of task_bps_ht? It will contain
> > > > > other perf_event types, so we will need to test type as well, but on
> > > > > the positive side, we don't need any management of the separate
> > > > > container.
> > > >
> > > > Hmm, yes, I looked at that but then decided against messing the
> > > > perf/core internals. The main issue I have with using perf_event_mutex
> > > > is that we might interfere with perf/core's locking rules as well as
> > > > interfere with other concurrent perf event additions. Using
> > > > perf_event_list is very likely a no-go because it requires reworking
> > > > perf/core as well.
> > > >
> > > > I can already hear Peter shouting, but maybe I'm wrong. :-)
> > >
> > > Let's wait for Peter to shout then :)
> > > A significant part of this change is having per-task data w/o having
> > > per-task data.
> > >
> > > The current perf-related data in task_struct is already multiple words
> > > and it's also not used in lots of production cases.
> > > Maybe we could have something like:
> > >
> > > struct perf_task_data* lazily_allocated_perf_data;
> > >
> > > that's lazily allocated on first use instead of the current
> > > perf_event_ctxp/perf_event_mutex/perf_event_list.
> > > This way we could both reduce task_size when perf is not used and have
> > > more perf-related data (incl breakpoints) when it's used.
> >
> > I don't mind either option, so keeping task_struct bloat in mind, we have:
> >
> > 1. rhashtable option, no changes to task_struct.
> >
> > 2. add the breakpoint mutex + list to task_struct.
> >
> > 3. add something like hw_breakpoint_task_data* and allocate lazily.
> >
> > 4. (your proposal) move all of perf data into a new struct (+add
> > hw_breakpoint things in there) that is lazily allocated.
> >
> > I don't think perf is that infrequently used, and I can't estimate
> > performance impact, so I don't like #4 too much personally. My
> > preferred compromise would be #3, but at the same time I'd rather not
> > bloat task_struct even with 8 extra infrequently used bytes. Am I too
> > paranoid?
> >
> > Preferences?
>
>
> There is also this "could eventually get its own" comment:
>
> static struct pmu perf_breakpoint = {
> .task_ctx_nr = perf_sw_context, /* could eventually get its own */
> https://elixir.bootlin.com/linux/v5.19-rc1/source/kernel/events/hw_breakpoint.c#L669
>
> If it gets its own, then it also gets a perf_event_context pointer in
> task_struct:
> https://elixir.bootlin.com/linux/v5.19-rc1/source/include/linux/sched.h#L1229
> And perf_event_context has its own mutex and lots of other stuff.
> But I don't know what other implications it has.

Relying on perf events to be the only way that instantiates
breakpoints does not work, because hw_breakpoint is also used by
ptrace independently.

On a whole, adding lazily allocated data to task_struct is not as
simple as the rhashtable option (need to take care of fork and exit
and make sure the lazily allocated data lives long enough etc.). I
question the added complexity vs. the benefit, when using the
rhashtable avoids all that. If I get rid of the O(#cpu) loops it also
doesn't show up in profiles anymore and any efforts to optimize here
are not buying us much in terms of performance.

If the main issue is the mutex, I suppose we can find a hole in
task_struct and stick it there (there's a massive 32-byte hole above
task_struct::stats).

Was the mutex the only benefit?