Re: [PATCH v2 2/2] module: Merge same-name module load requests

From: Prarit Bhargava
Date: Sun Dec 04 2022 - 15:00:25 EST


On 11/29/22 08:13, Petr Pavlu wrote:
On 11/28/22 17:29, Prarit Bhargava wrote:
On 11/14/22 10:45, David Hildenbrand wrote:
On 14.11.22 16:38, Luis Chamberlain wrote:
On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
On 12.11.22 02:47, Luis Chamberlain wrote:
On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
On 10/18/22 20:33, Luis Chamberlain wrote:
On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
The patch does address a regression observed after commit
6e6de3dee51a
("kernel/module.c: Only return -EEXIST for modules that have
finished
loading"). I guess it can have a Fixes tag added to the patch.

I think it is hard to split this patch into parts because the
implemented
"optimization" is the fix.

git describe --contains 6e6de3dee51a
v5.3-rc1~38^2~6

I'm a bit torn about this situation. Reverting 6e6de3dee51a would
be the
right thing to do, but without it, it still leaves the issue reported
by Prarit Bhargava. We need a way to resolve the issue on stable and
then your optimizations can be applied on top.

Simpler could be to do the following:

diff --git a/kernel/module/main.c b/kernel/module/main.c
index d02d39c7174e..0302ac387e93 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
       sched_annotate_sleep();
       mutex_lock(&module_mutex);
       mod = find_module_all(name, strlen(name), true);
-    ret = !mod || mod->state == MODULE_STATE_LIVE;
+    ret = !mod || mod->state == MODULE_STATE_LIVE
+        || mod->state == MODULE_STATE_GOING;
       mutex_unlock(&module_mutex);
       return ret;
@@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module
*mod)
       mutex_lock(&module_mutex);
       old = find_module_all(mod->name, strlen(mod->name), true);
       if (old != NULL) {
-        if (old->state != MODULE_STATE_LIVE) {
+        if (old->state == MODULE_STATE_COMING
+            || old->state == MODULE_STATE_UNFORMED) {
               /* Wait in case it fails to load. */
               mutex_unlock(&module_mutex);
               err = wait_event_interruptible(module_wq,
@@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module
*mod)
                   goto out_unlocked;
               goto again;
           }
-        err = -EEXIST;
+        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
           goto out;
       }
       mod_update_bounds(mod);


Prarit, can you verify this still does not break the issue you
reported?
David, does this also fix your issue?

I didn't try, but from a quick glimpse I assume no. Allocating module
space
happens before handling eventual duplicates right now, before a
module even
is "alive" and in the MODULE_STATE_UNFORMED state.

The first two hunks are a revert of commit 6e6de3dee51a and I'm under
the impression that cauased your issues as *more* modules states are
allowed through.

The last hunk tries to fix what 6e6de3dee51a wanted to do.


Note that I don't think the issue I raised is due to 6e6de3dee51a.

But maybe I am missing something important.

Please do test if you can.

I don't have the machine at hand right now. But, again, I doubt this
will fix it.


The flow is in load_module():

    mod = layout_and_allocate(info, flags);
    if (IS_ERR(mod)) {
        ...
    }

    audit_log_kern_module(mod->name);

    /* Reserve our place in the list. */
    err = add_unformed_module(mod);
    if (err)
        goto free_module;


You can have 400 threads in layout_and_allocate() loading the same
module at the same time and running out of module space. Any changes to
add_unformed_module() and finished_loading() won't change that, because
they are not involved before the module space allocations happened.


I'd like to see a refreshed patch but I tested the latest version and
see that the boot time is LONGER with the change

Before:

[11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s
(kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
multi-user.target reached after 15.606s in userspace.

After:

Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s
(kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
multi-user.target reached after 23.093s in userspace.

Subsequent reboots also indicate that userspace boot time is longer
after the change.

Thanks for testing this patch, that is an interesting result.

I see the following dependency chain on my system (openSUSE Tumbleweed):
multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service.

My understanding is that the udev trigger service only performs the trigger
operation but does not actually wait on all devices to be processed by udevd.
In other words, handling of the forced udev events can still be in progress
after multi-user.target is reached.

The current serialization of same-name module loads can result in many udev
workers sleeping in add_unformed_module() and hence creating at that point
less pressure on the CPU time from udevd. I wonder if this then maybe allows
other work needed to reach multi-user.target to proceed faster.

Could you please boot the machine with 'udev.log_level=debug' and provide me
logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
and with the discussed patch?

Petr, I haven't missed your request. I'm waiting for the system to become free (I'm running a week long test on it). Hopefully I can get this data to you tomorrow AM.

My apologies for the wait,

P.


Thanks,
Petr