Re: [PATCH v5 00/50] Improvements to memory use

From: Namhyung Kim
Date: Wed Nov 29 2023 - 20:16:34 EST


On Mon, Nov 27, 2023 at 2:09 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> Fix memory leaks detected by address/leak sanitizer affecting LBR
> call-graphs, perf mem and BPF offcpu.
>
> Make branch_type_stat in callchain_list optional as it is large and
> not always necessary - in particular it isn't used by perf top.
>
> Make the allocations of zstd streams, kernel symbols and event copies
> lazier in order to save memory in cases like perf record.
>
> Handle the thread exit event and have it remove the thread from the
> threads set in machine. Don't do this for perf report as it causes a
> regression for task lists, which assume threads are never removed from
> the machine's set, and offcpu events, that may sythensize samples for
> threads that have exited.
>
> Avoid using 8kb buffers for filename__read_str which is excessive for
> reading CPU maps. Add io_dir as an allocation free readdir
> replacement, opendir allocating 32kb by default and the code uses it
> recursively.
>
> Shrink perf map using a two value byte to replace two function
> pointers. Modify the implementation of maps to not use an rbtree as
> the container for maps, instead use a sorted array. Improve locking
> and reference counting issues.
>
> Similar to maps separate out and reimplement threads to use a hashmap
> for lower memory consumption and faster look up. The fixes a
> regression in memory usage where reference count checking switched to
> using non-invasive tree nodes. Reduce its default size by 32 times
> and improve locking discipline. Also, fix regressions where tids had
> become unordered to make `perf report --tasks` and
> `perf trace --summary` output easier to read.
>
> Better encapsulate the dsos abstraction. Remove the linked list and
> rbtree used for faster iteration and log(n) lookup to a sorted array
> for similar performance but half the memory usage per dso. Improve
> reference counting and locking discipline, adding reference count
> checking to dso. Experimented with, but abandoned, a hashmap
> implementation due to the need for extra storage and the keys not
> being stable.
>
> The overall effect is to reduce memory consumption significantly for
> perf top - with call graphs enabled running longer before 1GB of
> memory is consumed. For a perf record of 'true', the memory
> consumption goes from 39912kb max resident to 20096kb max resident -
> nearly halved. perf inject with -b of a system wide perf record of
> 'true' reduces the max resident by roughly 4.5% (3.4% in v4 due to
> branch_type_stat changes being merged). This is while improving
> correctness with locking discipline and reference count checking.
>
> Patch organization (v5):
> - 50 patches is a lot, the patches aren't divided as they merge conflict and
> later patches, for example in dsos, rely on the changes and fixes to maps.

You don't need to do it all at once. AFAIK the io_dir changes are independent
and you can separate map/maps changes from others. Maybe you can wait
for map changes merged before working on the dso changes. I know it'd take
more time but it'd be easier to deal with smaller patches focusing on a single
factor both for you and the reviewers.

p.s. I know I also have a set of ~50 patches and feel sorry about saying
like this. ;-p Maybe I need to split the data type profiling series too.

Thanks,
Namhyung


> - the dso reference count checking patch is larger due to switch use of dso to
> be by accessors, to encapsulate the reference count checker macros. The
> reference count checking changes within this largely mechanical change amount
> to a few lines and so weren't separated.
> - the first patch contains a build fix if the rwsem error checking is
> enabled missed from v3.
> - the next patches are an assortment of memory size fixes.
> - the next patches are the refactoring of maps.
> - the next patches are the refactoring of threads.
> - the next patches are the refactoring of dsos.
> - finally reference count checking is added to dso and some lock/reference
> count issues are resolved. This is done after changing the data structures,
> for example, as the single pointer on an array is easier to add reference
> count checking to compared to the 5 previous pointers.
>
> v5: 3 patches were merged. 2nd patch addressed feedback from
> namhyung@xxxxxxxxxx and Guilherme Amadio <amadio@xxxxxxxxxx>. 4th
> patch rename function to getdelim as suggested by
> namhyung@xxxxxxxxxx. 5 patch adds the missing sysfs mountpoint as
> suggested by namhyung@xxxxxxxxxx. 49th patch fix a missed put in
> the dso_data tests.
> v4: Rebased as 11 changes moved to perf-tools-next. Address comments
> from v3 such as error checking on zstd streams. Improve the
> dsos/dso in ways similar to threads and maps, with the addition of
> reference count checking on dso.
> v3: Additional memory/speed improvements, in particular for maps and
> threads. Address review comments from namhyung@xxxxxxxxxx and
> adrian.hunter@xxxxxxxxx.
> v2: Add additional memory fixes on top of initial LBR and rc check
> fixes.
>
> Ian Rogers (50):
> perf comm: Use regular mutex
> libperf: Lazily allocate/size mmap event copy
> perf mmap: Lazily initialize zstd streams
> tools api fs: Switch filename__read_str to use io.h
> tools api fs: Avoid reading whole file for a 1 byte bool
> tools lib api: Add io_dir an allocation free readdir alternative
> perf maps: Switch modules tree walk to io_dir__readdir
> perf record: Be lazier in allocating lost samples buffer
> perf pmu: Switch to io_dir__readdir
> perf header: Switch mem topology to io_dir__readdir
> perf events: Remove scandir in thread synthesis
> perf map: Simplify map_ip/unmap_ip and make map size smaller
> perf maps: Move symbol maps functions to maps.c
> perf thread: Add missing RC_CHK_EQUAL
> perf maps: Add maps__for_each_map to call a function on each entry
> perf maps: Add remove maps function to remove a map based on callback
> perf debug: Expose debug file
> perf maps: Refactor maps__fixup_overlappings
> perf maps: Do simple merge if given map doesn't overlap
> perf maps: Rename clone to copy from
> perf maps: Add maps__load_first
> perf maps: Add find next entry to give entry after the given map
> perf maps: Reduce scope of map_rb_node and maps internals
> perf maps: Fix up overlaps during fixup_end
> perf maps: Switch from rbtree to lazily sorted array for addresses
> perf maps: Get map before returning in maps__find
> perf maps: Get map before returning in maps__find_by_name
> perf maps: Get map before returning in maps__find_next_entry
> perf maps: Hide maps internals
> perf maps: Locking tidy up of nr_maps
> perf dso: Reorder variables to save space in struct dso
> perf report: Sort child tasks by tid
> perf trace: Ignore thread hashing in summary
> perf machine: Move fprintf to for_each loop and a callback
> perf threads: Move threads to its own files
> perf threads: Switch from rbtree to hashmap
> perf threads: Reduce table size from 256 to 8
> perf dsos: Attempt to better abstract dsos internals
> perf dsos: Tidy reference counting and locking
> perf dsos: Add dsos__for_each_dso
> perf dso: Move dso functions out of dsos
> perf dsos: Switch more loops to dsos__for_each_dso
> perf dsos: Switch backing storage to array from rbtree/list
> perf dsos: Remove __dsos__addnew
> perf dsos: Remove __dsos__findnew_link_by_longname_id
> perf dsos: Switch hand code to bsearch
> perf dso: Add reference count checking and accessor functions
> perf dso: Reference counting related fixes
> perf dso: Use container_of to avoid a pointer in dso_data
> perf env: Avoid recursively taking env->bpf_progs.lock
>
> tools/lib/api/Makefile | 2 +-
> tools/lib/api/fs/fs.c | 80 +-
> tools/lib/api/io.h | 11 +-
> tools/lib/api/io_dir.h | 75 +
> tools/lib/perf/include/internal/mmap.h | 3 +-
> tools/lib/perf/mmap.c | 21 +-
> tools/perf/arch/x86/tests/dwarf-unwind.c | 1 +
> tools/perf/arch/x86/util/event.c | 103 +-
> tools/perf/builtin-annotate.c | 6 +-
> tools/perf/builtin-buildid-cache.c | 2 +-
> tools/perf/builtin-buildid-list.c | 18 +-
> tools/perf/builtin-inject.c | 96 +-
> tools/perf/builtin-kallsyms.c | 2 +-
> tools/perf/builtin-mem.c | 4 +-
> tools/perf/builtin-record.c | 57 +-
> tools/perf/builtin-report.c | 243 ++--
> tools/perf/builtin-script.c | 8 +-
> tools/perf/builtin-top.c | 4 +-
> tools/perf/builtin-trace.c | 41 +-
> tools/perf/tests/code-reading.c | 8 +-
> tools/perf/tests/dso-data.c | 67 +-
> tools/perf/tests/hists_common.c | 6 +-
> tools/perf/tests/hists_cumulate.c | 4 +-
> tools/perf/tests/hists_output.c | 2 +-
> tools/perf/tests/maps.c | 64 +-
> tools/perf/tests/symbols.c | 2 +-
> tools/perf/tests/thread-maps-share.c | 8 +-
> tools/perf/tests/vmlinux-kallsyms.c | 181 +--
> tools/perf/ui/browsers/annotate.c | 6 +-
> tools/perf/ui/browsers/hists.c | 8 +-
> tools/perf/ui/browsers/map.c | 4 +-
> tools/perf/util/Build | 1 +
> tools/perf/util/annotate.c | 44 +-
> tools/perf/util/auxtrace.c | 2 +-
> tools/perf/util/block-info.c | 2 +-
> tools/perf/util/bpf-event.c | 17 +-
> tools/perf/util/bpf-event.h | 12 +-
> tools/perf/util/bpf_lock_contention.c | 10 +-
> tools/perf/util/build-id.c | 136 +-
> tools/perf/util/build-id.h | 2 -
> tools/perf/util/callchain.c | 4 +-
> tools/perf/util/comm.c | 10 +-
> tools/perf/util/compress.h | 6 +-
> tools/perf/util/data-convert-json.c | 2 +-
> tools/perf/util/db-export.c | 6 +-
> tools/perf/util/debug.c | 22 +-
> tools/perf/util/debug.h | 1 +
> tools/perf/util/dlfilter.c | 12 +-
> tools/perf/util/dso.c | 468 ++++---
> tools/perf/util/dso.h | 544 ++++++--
> tools/perf/util/dsos.c | 529 ++++---
> tools/perf/util/dsos.h | 40 +-
> tools/perf/util/env.c | 53 +-
> tools/perf/util/env.h | 4 +
> tools/perf/util/event.c | 12 +-
> tools/perf/util/header.c | 47 +-
> tools/perf/util/hist.c | 4 +-
> tools/perf/util/intel-pt.c | 22 +-
> tools/perf/util/machine.c | 652 +++------
> tools/perf/util/machine.h | 32 +-
> tools/perf/util/map.c | 93 +-
> tools/perf/util/map.h | 83 +-
> tools/perf/util/maps.c | 1239 +++++++++++++----
> tools/perf/util/maps.h | 95 +-
> tools/perf/util/mmap.c | 5 +-
> tools/perf/util/mmap.h | 1 -
> tools/perf/util/pmu.c | 48 +-
> tools/perf/util/pmus.c | 30 +-
> tools/perf/util/probe-event.c | 62 +-
> tools/perf/util/rb_resort.h | 5 -
> .../scripting-engines/trace-event-python.c | 21 +-
> tools/perf/util/session.c | 21 +
> tools/perf/util/session.h | 2 +
> tools/perf/util/sort.c | 19 +-
> tools/perf/util/srcline.c | 65 +-
> tools/perf/util/symbol-elf.c | 138 +-
> tools/perf/util/symbol.c | 521 ++-----
> tools/perf/util/symbol.h | 1 -
> tools/perf/util/symbol_fprintf.c | 4 +-
> tools/perf/util/synthetic-events.c | 156 ++-
> tools/perf/util/thread.c | 48 +-
> tools/perf/util/thread.h | 6 -
> tools/perf/util/threads.c | 186 +++
> tools/perf/util/threads.h | 35 +
> tools/perf/util/unwind-libunwind-local.c | 50 +-
> tools/perf/util/unwind-libunwind.c | 9 +-
> tools/perf/util/vdso.c | 89 +-
> tools/perf/util/zstd.c | 63 +-
> 88 files changed, 4101 insertions(+), 2827 deletions(-)
> create mode 100644 tools/lib/api/io_dir.h
> create mode 100644 tools/perf/util/threads.c
> create mode 100644 tools/perf/util/threads.h
>
> --
> 2.43.0.rc1.413.gea7ed67945-goog
>