Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text

From: Song Liu
Date: Wed Oct 12 2022 - 01:38:03 EST




> On Oct 11, 2022, at 1:40 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote:
>
> On Tue, 2022-10-11 at 16:25 +0000, Song Liu wrote:
>>> Maybe this is just me missing some vmalloc understanding, but this
>>> pointer to an all zero vm_struct seems weird too. Are there other
>>> vmap
>>> allocations like this? Which vmap APIs work with this and which
>>> don't?
>>
>> There are two vmap trees at the moment: free_area_ tree and
>> vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while
>> vmap_area_ tree contains vmap backed by vm_struct, and thus uses
>> vmap->vm.
>>
>> This set add a new tree, free_text_area_. This tree is different to
>> the other two, as it uses subtree_max_size, and it is also backed
>> by vm_struct. To handle this requirement without growing vmap_struct,
>> we introduced all_text_vm to store the vm_struct for free_text_area_
>> tree.
>>
>> free_text_area_ tree is different to vmap_area_ tree. Each vmap in
>> vmap_area_ tree has its own vm_struct (1 to 1 mapping), while
>> multiple vmap in free_text_area_ tree map to a single vm_struct.
>>
>> Also, free_text_area_ handles granularity < PAGE_SIZE; while the
>> other two trees only work with PAGE_SIZE aligned memory.
>>
>> Does this answer your questions?
>
> I mean from the perspective of someone trying to use this without
> diving into the entire implementation.
>
> The function is called vmalloc_exec() and is freed with vfree_exec().
> Makes sense. But with the other vmallocs_foo's (including previous
> vmalloc_exec() implementations) you can call find_vm_area(), etc on
> them. They show in "vmallocinfo" and generally behave similarly. That
> isn't true for these new allocations, right?

That's right. These operations are not supported (at least for now).

>
> Then you have code that operates on module text like:
> if (is_vmalloc_or_module_addr(addr))
> pfn = vmalloc_to_pfn(addr);
>
> It looks like it would work (on x86 at least). Should it be expected
> to?
>
> Especially after this patch, where there is memory that isn't even
> tracked by the original vmap_area trees, it is pretty much a separate
> allocator. So I think it might be nice to spell out which other vmalloc
> APIs work with these new functions since they are named "vmalloc".
> Maybe just say none of them do.

I guess it is fair to call this a separate allocator. Maybe
vmalloc_exec is not the right name? I do think this is the best
way to build an allocator with vmap tree logic.

>
>
> Separate from that, I guess you are planning to make this limited to
> certain architectures? It might be better to put logic with assumptions
> about x86 boot time page table details inside arch/x86 somewhere.

Yes, the architecture need some text_poke mechanism to use this.
On BPF side, x86_64 calls this directly from arch code (jit engine),
so it is mostly covered. For modules, we need to handle this better.

Thanks,
Song