Re: [linus:master] [bpf] c930472552: WARNING:at_kernel/bpf/memalloc.c:#bpf_mem_alloc_init

From: Hou Tao
Date: Tue Nov 07 2023 - 05:43:36 EST


Hi,

On 11/4/2023 12:49 AM, Yonghong Song wrote:
>
> On 11/2/23 11:54 PM, Hou Tao wrote:
>> Hi,
>>
>> On 11/3/2023 12:08 AM, Yonghong Song wrote:
>>> On 11/2/23 6:40 AM, Hou Tao wrote:
>>>> Hi Alexei,
>>>>
>>>> On 10/31/2023 4:01 PM, Hou Tao wrote:
>>>>> Hi,
>>>>>
>>>>> On 10/30/2023 10:11 PM, kernel test robot wrote:
>>>>>> hi, Hou Tao,
>>>>>>
>>>>>> we noticed a WARN_ONCE added in this commit was hit in our tests.
>>>>>> FYI.
>>>>>>
>>>>>>
SNIP
>>>> I see what has happened. The problem is twofold:
>>>> (1) The object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
>>>> slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
>>>> enabled by default for kmalloc slab, so align is 64 and size is 128
>>>> for
>>>> kmalloc-cg-96. So when unit_alloc() does kmalloc_node(96,
>>>> __GFP_ACCOUNT,
>>>> node), ksize() will return 128 instead of 96 for the returned pointer.
>>>> SLUB has a similar merge logic, but because its align is 8 under
>>>> x86-64,
>>>> so the warning doesn't happen for i386 + SLUB, but I think the similar
>>>> problem may exist for other architectures.
>>>> (2) kmalloc_size_roundup() returns the object_size of kmalloc-96
>>>> instead
>>>> of kmalloc-cg-96, so bpf_mem_cache_adjust_size() doesn't adjust
>>>> size_index accordingly. The reason why the object_size of
>>>> kmalloc-96 is
>>>> 96 instead of 128 is that there is slab merge for kmalloc-96.
>>>>
>>>> About how to fix the problem, I have two ideas:
>>>> The first is to introduce kmalloc_size_roundup_flags(), so
>>>> bpf_mem_cache_adjust_size() could use kmalloc_size_roundup_flags(size,
>>>> __GFP_ACCOUNT) to get the object_size of kmalloc-cg-xxx. It could fix
>>>> the warning for now, but the warning may pop-up occasionally due to
>>>> SLUB
>>>> merge and unusual slab align. The second is just using the
>>>> bpf_mem_cache
>>>> pointer to get the unit_size which is saved before the to-be-free
>>>> pointer. Its downside is that it may can not be able to skip the free
>>>> operation for pointer which is not allocated from bpf ma, but I
>>>> think it
>>>> is acceptable. I prefer the latter solution. What do you think ?
>>>
>>> Is it possible that in bpf_mem_cache_adjust_size(), we do a series of
>>> kmalloc (for supported bucket size) and call ksize() to get the actual
>>> allocated object size. So eventually all possible allocated object
>>> sizes
>>> will be used for size_index[]. This will avoid all kind of special
>>> corner cases due to config/macro/arch etc. WDYT?
>> It is basically the same as the first proposed solution and it has the
>> same flaw. The problem is that slab merge can happen in any time, so the
>> return value of ksize() may change even all passed pointers are
>> allocated from the same slab. Considering the following case: during the
>> invocation of bpf_mem_cache_adjust_size() or the initialization of
>> bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object
>> returns 96. But after these invocations, a new slab created by a kernel
>> module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96
>> is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB,
>> because it is alignment requirement is 64 for 96-bytes slab). So soon or
>
> So, the object_size for allocated objects in that is adjusted from 96
> to 128
> while previously allocated objects should have no change, it is merely
> ksize(old_obj)
> previous return 96, now returns 128, right? Okay, so this is indeed a
> problem
> since we use ksize() to decide the bucket.

Yes. The object_size of underlying slab changes, so the return value of
ksize() will change as well.
>
>
>> later, when bpf_global_ma frees a 96-byte-sized pointer which is
>> allocated from a bpf_mem_cache in which unit_size is 96, bpf_mem_free()
>> will free the pointer through a bpf_mem_cache in which unit_size is 128,
>> because the return value of ksize() changes. Maybe we should introduce a
>> new API in mm which returns size instead of object_size of underlying
>> slab, so the return value will not change due to slab merge.
>
> In this case, to avoid the warning, indeed we need to use '96' instead
> of '128'.
> So use the original ksize() return value is indeed a solution.
> We could use the mechanism similar to percpu alloc to save '96' in the
> memory.

We have already saved the pointer of bpf_mem_cache in the extra space
(aka LLIST_NODE_SZ) which is allocated together with the returned
pointer, so I think we could use bpf_mem_cache->unit_size to get the
size of the free pointer directly. I will check whether or not there is
performance degradation before posting the patch.

Regards,
Tao