Re: [linus:master] [bpf] c930472552: WARNING:at_kernel/bpf/memalloc.c:#bpf_mem_alloc_init

From: Yonghong Song
Date: Tue Nov 07 2023 - 11:13:38 EST



On 11/7/23 2:43 AM, Hou Tao wrote:
Hi,

On 11/4/2023 12:49 AM, Yonghong Song wrote:
On 11/2/23 11:54 PM, Hou Tao wrote:
Hi,

On 11/3/2023 12:08 AM, Yonghong Song wrote:
On 11/2/23 6:40 AM, Hou Tao wrote:
Hi Alexei,

On 10/31/2023 4:01 PM, Hou Tao wrote:
Hi,

On 10/30/2023 10:11 PM, kernel test robot wrote:
hi, Hou Tao,

we noticed a WARN_ONCE added in this commit was hit in our tests.
FYI.


SNIP
I see what has happened. The problem is twofold:
(1) The object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
enabled by default for kmalloc slab, so align is 64 and size is 128
for
kmalloc-cg-96. So when unit_alloc() does kmalloc_node(96,
__GFP_ACCOUNT,
node), ksize() will return 128 instead of 96 for the returned pointer.
SLUB has a similar merge logic, but because its align is 8 under
x86-64,
so the warning doesn't happen for i386 + SLUB, but I think the similar
problem may exist for other architectures.
(2) kmalloc_size_roundup() returns the object_size of kmalloc-96
instead
of kmalloc-cg-96, so bpf_mem_cache_adjust_size() doesn't adjust
size_index accordingly. The reason why the object_size of
kmalloc-96 is
96 instead of 128 is that there is slab merge for kmalloc-96.

About how to fix the problem, I have two ideas:
The first is to introduce kmalloc_size_roundup_flags(), so
bpf_mem_cache_adjust_size() could use kmalloc_size_roundup_flags(size,
__GFP_ACCOUNT) to get the object_size of kmalloc-cg-xxx. It could fix
the warning for now, but the warning may pop-up occasionally due to
SLUB
merge and unusual slab align. The second is just using the
bpf_mem_cache
pointer to get the unit_size which is saved before the to-be-free
pointer. Its downside is that it may can not be able to skip the free
operation for pointer which is not allocated from bpf ma, but I
think it
is acceptable. I prefer the latter solution. What do you think ?
Is it possible that in bpf_mem_cache_adjust_size(), we do a series of
kmalloc (for supported bucket size) and call ksize() to get the actual
allocated object size. So eventually all possible allocated object
sizes
will be used for size_index[]. This will avoid all kind of special
corner cases due to config/macro/arch etc. WDYT?
It is basically the same as the first proposed solution and it has the
same flaw. The problem is that slab merge can happen in any time, so the
return value of ksize() may change even all passed pointers are
allocated from the same slab. Considering the following case: during the
invocation of bpf_mem_cache_adjust_size() or the initialization of
bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object
returns 96. But after these invocations, a new slab created by a kernel
module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96
is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB,
because it is alignment requirement is 64 for 96-bytes slab). So soon or
So, the object_size for allocated objects in that is adjusted from 96
to 128
while previously allocated objects should have no change, it is merely
ksize(old_obj)
previous return 96, now returns 128, right? Okay, so this is indeed a
problem
since we use ksize() to decide the bucket.
Yes. The object_size of underlying slab changes, so the return value of
ksize() will change as well.

later, when bpf_global_ma frees a 96-byte-sized pointer which is
allocated from a bpf_mem_cache in which unit_size is 96, bpf_mem_free()
will free the pointer through a bpf_mem_cache in which unit_size is 128,
because the return value of ksize() changes. Maybe we should introduce a
new API in mm which returns size instead of object_size of underlying
slab, so the return value will not change due to slab merge.
In this case, to avoid the warning, indeed we need to use '96' instead
of '128'.
So use the original ksize() return value is indeed a solution.
We could use the mechanism similar to percpu alloc to save '96' in the
memory.
We have already saved the pointer of bpf_mem_cache in the extra space
(aka LLIST_NODE_SZ) which is allocated together with the returned
pointer, so I think we could use bpf_mem_cache->unit_size to get the
size of the free pointer directly. I will check whether or not there is
performance degradation before posting the patch.

See:

bpf_local_storage.c:            err = bpf_mem_alloc_init(&smap->selem_ma, smap->elem_size, false);
bpf_local_storage.c:            err = bpf_mem_alloc_init(&smap->storage_ma, sizeof(struct bpf_local_storage), false);
core.c: ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
core.c: ret = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
cpumask.c:      ret = bpf_mem_alloc_init(&bpf_cpumask_ma, sizeof(struct bpf_cpumask), false);
hashtab.c:              err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
hashtab.c:                      err = bpf_mem_alloc_init(&htab->pcpu_ma,
memalloc.c:int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)

Some 'size' parameter in core.c is zero.
Not sure how exactly you will resolve this issue based on
bpf_mem_cache->unit_size. But looking forward to your patch!


Regards,
Tao