Re: [PATCH v2 1/2] mm/slub: enable debugging memory wasting of kmalloc

From: Feng Tang
Date: Mon Jul 25 2022 - 09:24:07 EST


Hi Kefeng,

Thanks for the review.

On 2022/7/25 20:19, Kefeng Wang wrote:

On 2022/7/25 19:20, Feng Tang wrote:
kmalloc's API family is critical for mm, with one shortcoming that
its object size is fixed to be power of 2. When user requests memory
for '2^n + 1' bytes, actually 2^(n+1) bytes will be allocated, so
in worst case, there is around 50% memory space waste.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab info:

[ 26.062145] kmalloc-2k 814056KB 814056KB

>From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are picked just to demo the problem)

So add an way to track each kmalloc's memory waste info, and leverage
the existing SLUB debug framework to show its call stack info, so
that user can evaluate the waste situation, identify some hot spots
and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
/sys/kernel/debug/slab/kmalloc-xx/alloc_traces, one example of
'kmalloc-4k' after boot is:

126 ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe] waste=233856/1856 age=1493302/1493830/1494358 pid=1284 cpus=32 nodes=1
__slab_alloc.isra.86+0x52/0x80
__kmalloc_node+0x143/0x350
ixgbe_alloc_q_vector+0xa5/0x4a0 [ixgbe]
ixgbe_init_interrupt_scheme+0x1a6/0x730 [ixgbe]
ixgbe_probe+0xc8e/0x10d0 [ixgbe]
local_pci_probe+0x42/0x80
work_for_cpu_fn+0x13/0x20
process_one_work+0x1c5/0x390

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total). And when system starts some real
workload like multiple docker instances, there are more
severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@xxxxxxxxxx/

[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]

Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
---
since v1:
* limit the 'orig_size' to kmalloc objects only, and save
it after track in metadata (Vlastimil Babka)
* fix a offset calculation problem in print_trailer

since RFC:
* fix problems in kmem_cache_alloc_bulk() and records sorting,
improve the print format (Hyeonggon Yoo)
* fix a compiling issue found by 0Day bot
* update the commit log based info from iova developers



include/linux/slab.h | 2 +
mm/slub.c | 96 ++++++++++++++++++++++++++++++++++++--------
2 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0fefdf528e0d..a713b0e5bbcd 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -29,6 +29,8 @@
#define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U)
/* DEBUG: Poison objects */
#define SLAB_POISON ((slab_flags_t __force)0x00000800U)
+/* Indicate a kmalloc slab */
+#define SLAB_KMALLOC ((slab_flags_t __force)0x00001000U)
/* Align objs on cache lines */
#define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U)
/* Use GFP_DMA memory */
diff --git a/mm/slub.c b/mm/slub.c
index b1281b8654bd..9763a38bc4f0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -191,6 +191,12 @@ static inline bool kmem_cache_debug(struct kmem_cache *s)
return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
}
+static inline bool slub_debug_orig_size(struct kmem_cache *s)
+{
+ return (s->flags & SLAB_KMALLOC &&
+ kmem_cache_debug_flags(s, SLAB_STORE_USER));
Swap two judgments to reduce the SLAB_KMALLOC check if no SLAB_STORE_USER.


Ok, will change.

+}
+
void *fixup_red_left(struct kmem_cache *s, void *p)
{
if (kmem_cache_debug_flags(s, SLAB_RED_ZONE))
@@ -814,6 +820,36 @@ static void print_slab_info(const struct slab *slab)
pr_err("Slab 0x%p objects=%u used=%u fp=0x%p flags=%pGp\n",
slab, slab->objects, slab->inuse, slab->freelist,
folio_flags(folio, 0));
+
+}
+static inline void set_orig_size(struct kmem_cache *s,
+ void *object, unsigned int orig_size)
+{
+ void *p = kasan_reset_tag(object);
+
+ if (!slub_debug_orig_size(s))
+ return;
+
+ p = object + get_info_end(s);
Look like p += get_info_end(s);  ?
+
+ if (s->flags & SLAB_STORE_USER)
+ p += sizeof(struct track) * 2;
+
+ *(unsigned int *)p = orig_size;
+}
+
+static unsigned int get_orig_size(struct kmem_cache *s, void *object)
+{
+ void *p = kasan_reset_tag(object);
+
+ if (!slub_debug_orig_size(s))
+ return s->object_size;
+
+ p = object + get_info_end(s);
ditto...

Good catch! will change both of them, thanks!

Thanks,
Feng

+ if (s->flags & SLAB_STORE_USER)
+ p += sizeof(struct track) * 2;
+
+ return *(unsigned int *)p;
}

[...]