[PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information

From: Waiman Long
Date: Mon Feb 07 2022 - 20:06:49 EST


v5:
- Apply the following changes to patch 3
1) Make cgroup_name() write directly into kbuf without using an
intermediate buffer.
2) Change the terminology from "offline memcg" to "dying memcg" to align
better with similar terms used elsewhere in the kernel.

v4:
- Take rcu_read_lock() when memcg is being accessed as suggested by
Michal.
- Make print_page_owner_memcg() return the new offset into the buffer
and put CONFIG_MEMCG block inside as suggested by Mike.
- Directly use TASK_COMM_LEN as length of name buffer as suggested by
Roman.

v3:
- Add unlikely() to patch 1 and clarify that -1 will not be returned.
- Use a helper function to print out memcg information in patch 3.
- Add a new patch 4 to store task command name in page_owner
structure.

While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of dying mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.

In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are dying or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:

Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 70984 (podman), ts 5421278969115 ns, free_ts 5420935666638 ns
PFN 3205061 type Movable Block 6259 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
prep_new_page+0x8e/0xb0
get_page_from_freelist+0xc4d/0xe50
__alloc_pages+0x172/0x320
alloc_pages_vma+0x84/0x230
shmem_alloc_page+0x3f/0x90
shmem_alloc_and_acct_page+0x76/0x1c0
shmem_getpage_gfp+0x48d/0x890
shmem_write_begin+0x36/0xc0
generic_perform_write+0xed/0x1d0
__generic_file_write_iter+0xdc/0x1b0
generic_file_write_iter+0x5d/0xb0
new_sync_write+0x11f/0x1b0
vfs_write+0x1ba/0x2a0
ksys_write+0x59/0xd0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to dying memcg libpod-conmon-fbc62060b5377479a7371cc16c5c596002945f2aa00d3d6d73a0cd0d148b6637.scope

So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.

With cgroup v1, /proc/cgroups can be read to find out the total number
of memory cgroups (online + dying). With cgroup v2, the cgroup.stat
of the root cgroup can be read to find the number of dying cgroups
(most likely pinned by dying memcgs).

The page_owner feature is not supposed to be enabled for production
system due to its memory overhead. However, if it is suspected that
dying memcgs are increasing over time, a test environment with page_owner
enabled can then be set up with appropriate workload for further analysis
on what may be causing the increasing number of dying memcgs.

Waiman Long (4):
lib/vsprintf: Avoid redundant work with 0 size
mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
mm/page_owner: Print memcg information
mm/page_owner: Record task command name

lib/vsprintf.c | 8 +++---
mm/page_owner.c | 72 ++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 62 insertions(+), 18 deletions(-)

--
2.27.0