Re: [RFC 2/2] kread: avoid duplicates

From: Edgecombe, Rick P
Date: Mon Apr 17 2023 - 13:34:03 EST


On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote:
> On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > With this we run into 0 wasted virtual memory bytes.
> >
> > Avoid what duplicates?
>
> David Hildenbrand had reported that with over 400 CPUs vmap space
> runs out and it seems it was related to module loading. I took a
> look and confirmed it. Module loading ends up requiring in the
> worst case 3 vmalloc allocations, so typically at least twice
> the size of the module size and in the worst case just add
> the decompressed module size:
>
> a) initial kernel_read*() call
> b) optional module decompression
> c) the actual module data copy we will keep
>
> Duplicate module requests that come from userspace end up being
> thrown
> in the trash bin, as only one module will be allocated.  Although
> there
> are checks for a module prior to requesting a module udev still
> doesn't
> do the best of a job to avoid that and so we end up with tons of
> duplicate module requests. We're talking about gigabytes of vmalloc
> bytes just lost because of this for large systems and megabytes for
> average systems. So for example with just 255 CPUs we can loose about
> 13.58 GiB, and for 8 CPUs about 226.53 MiB.
>
> I have patches to curtail 1/2 of that space by doing a check in
> kernel
> before we do the allocation in c) if the module is already present.
> For
> a) it is harder because userspace just passes a file descriptor. But
> since we can get the file path without the vmalloc this RFC suggest
> maybe we can add a new kernel_read*() for module loading where it
> makes
> sense to have only one read happen at a time.

I'm wondering how difficult it would be to just try to remove the
vmallocs in (a) and (b) and operate on a list of pages.

So the operations before module_patient_check_exists() are now:
1. decompressing (vmalloc)
2. signature check (vmalloc)
3. elf_validity_cache_copy()
4. early_mod_check() -> module_patient_check_exists()

Right? Then after that a bunch of arch code and other code outside of
modules operates on the vmalloc, so this other code would take a large
amount of changes to switch to a list of pages.

But did you consider teaching just 1-3 to operate on a list of pages?
And then move module_patient_check_exists() a little up in the order?
After module_patient_check_exists() you could vmap() the pages and hand
it off to the existing code located all over the place.

Then you can catch the duplicate requests before any vmalloc happens.
It also takes (a) and (b) down to one vmalloc even in the normal case,
as a side benefit. The changes to the signature check part might be
tricky though.

Sorry if this idea is off, I've got a little confused as this series
split into all these offshoots series.