Re: [PATCH v2 2/2] module: Merge same-name module load requests

From: Petr Pavlu
Date: Sun Nov 13 2022 - 11:45:15 EST


On 10/24/22 16:00, Prarit Bhargava wrote:
> On 10/24/22 08:37, Petr Pavlu wrote:
>> On 10/18/22 21:53, Prarit Bhargava wrote:
>>> Quoting from the original thread,
>>>
>>>>
>>>> Motivation for this patch is to fix an issue observed on larger machines with
>>>> many CPUs where it can take a significant amount of time during boot to run
>>>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
>>>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
>>>> attempt to load these modules too. The operation will eventually fail in the
>>>> init function of a respective module where it gets recognized that another
>>>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
>>>> is triggered for each CPU and so multiple loads of these modules will be
>>>> present. The current code then processes all such loads individually and
>>>> serializes them with the barrier in add_unformed_module().
>>>>
>>>
>>> The way to solve this is not in the module loading code, but in the udev
>>> code by adding a new event or in the userspace which handles the loading
>>> events.
>>>
>>> Option 1)
>>>
>>> Write/modify a udev rule to to use a flock userspace file lock to
>>> prevent repeated loading. The problem with this is that it is still
>>> racy and still consumes CPU time repeated load the ELF header and,
>>> depending on the system (ie a large number of cpus) would still cause a
>>> boot delay. This would be better than what we have and is worth looking
>>> at as a simple solution. I'd like to see boot times with this change,
>>> and I'll try to come up with a measurement on a large CPU system.
>>
>> It is not immediately clear to me how this can be done as a udev rule. You
>> mention that you'll try to test this on a large CPU system. Does it mean that
>> you have a prototype implemented already? If yes, could you please share it?
>>
>
> Hi Petr,
>
> Sorry, I haven't had a chance to actually test this out but I see this
> problem with the acpi_cpufreq and other multiple-cpu drivers which load
> once per logical cpu. I was thinking of adding a udev rule like:
>
> ACTION!="add", GOTO="acpi_cpufreq_end"
>
> # I may have to add CPU modaliases here to get this to work correctly
> ENV{MODALIAS}=="acpi:ACPI0007:", GOTO="acpi_cpufreq_start"
>
> GOTO="acpi_cpufreq_start"
> GOTO="acpi_cpufreq_end"
>
> LABEL="acpi_cpufreq_start"
>
> ENV{DELAY_MODALIAS}="$env{MODALIAS}"
> ENV{MODALIAS}=""
> PROGRAM="/bin/sh -c flock -n /tmp/delay_acpi_cpufreq sleep 2'",
> RESULT=="", RUN{builtin}+="kmod load $env{DELAY_MODALIAS}"
>
> LABEL="acpi_cpufreq_end"

Thanks, that is an interesting idea. I think though the artificial delay that
it introduces would not be good (if I'm reading the snippet correctly).

>>> Option 2)
>>>
>>> Create a new udev action, "add_once" to indicate to userspace that the
>>> module only needs to be loaded one time, and to ignore further load
>>> requests. This is a bit tricky as both kernel space and userspace would
>>> have be modified. The udev rule would end up looking very similar to
>>> what we now.
>>>
>>> The benefit of option 2 is that driver writers themselves can choose
>>> which drivers should issue "add_once" instead of add. Drivers that are
>>> known to run on all devices at once would call "add_once" to only issue
>>> a single load.
>>
>> On the device event side, I more wonder if it would be possible to avoid tying
>> up cpufreq and edac modules to individual CPU devices. Maybe their loading
>> could be attached to some platform device, even if it means introducing an
>> auxiliary device for this purpose? I need to look a bit more into this idea.
>
> That's an interesting idea and something I had not considered. Creating
> a virtual device and loading against that device would be much better
> (easier?) of a solution.

Made some progress on this.. Both acpi-cpufreq and pcc-cpufreq drivers have
their platform firmware interface defined by ACPI. Allowed performance states
and parameters must be same for each CPU. Instead of matching these drivers on
acpi:ACPI0007: or acpi:LNXCPU: (per-CPU devices), it is possible to check the
ACPI namespace early and create a virtual platform device for each of these
interfaces if it is available. My test patch is pasted at the end of the
email.

This looks to work well and has a benefit that no attempt is made to load
pcc-cpufreq on machines where PCC is not available, which should be most
systems. I think this change is useful independently of whether there will be
eventually any improvement on the udev or module loader side. My plan is to
send it for review to the cpufreq maintainers.

There still remains a problem though that a CPU is typically aliased by other
drivers too:

# modprobe --show-depends 'cpu:type:x86,ven0000fam0006mod0055:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006A,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0083,0084,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3,00E4,00E6,00E7,00EA,00F0,00F1,00F2,00F3,00F5,00F9,00FA,00FB,00FE,00FF,0100,0101,0102,0103,0104,0111,0120,0121,0123,0125,0126,0127,0128,0129,012A,012C,012D,012E,012F,0130,0131,0132,0133,0134,0137,0138,0139,013C,013E,013F,0140,0141,0142,0143,0160,0161,0162,0163,0164,0165,0171,01C0,01C1,01C2,01C4,01C5,01C6,0203,0204,020B,024A,025A,025B,025C,025D,025F'
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/crypto_simd.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/aesni-intel.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/ghash-clmulni-intel.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/gf128mul.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/polyval-generic.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/polyval-clmulni.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32c-intel.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32-pclmul.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crct10dif-pclmul.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/virt/lib/irqbypass.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm-intel.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/hwmon/coretemp.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/intel_powerclamp.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/x86_pkg_temp_thermal.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/nvdimm/libnvdimm.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/acpi/nfit/nfit.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/edac/skx_edac.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency-common.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency.ko
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/powercap/intel_rapl_common.ko

If any of these modules repeatedly fails to load then this will again delay
processing of 'udevadm trigger' during boot when a given system has many CPUs.

Cheers,
Petr


diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 0002eecbf870..b6fd14b829be 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -57,6 +57,7 @@ acpi-y += evged.o
acpi-y += sysfs.o
acpi-y += property.o
acpi-$(CONFIG_X86) += acpi_cmos_rtc.o
+acpi-$(CONFIG_X86) += acpi_cpufreq.o
acpi-$(CONFIG_X86) += x86/apple.o
acpi-$(CONFIG_X86) += x86/utils.o
acpi-$(CONFIG_X86) += x86/s2idle.o
diff --git a/drivers/acpi/acpi_cpufreq.c b/drivers/acpi/acpi_cpufreq.c
new file mode 100644
index 000000000000..3eebe58fbe9b
--- /dev/null
+++ b/drivers/acpi/acpi_cpufreq.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * ACPI support for Processor Performance Control and Processor Clocking
+ * Control.
+ */
+
+#include <linux/acpi.h>
+#include <linux/platform_device.h>
+
+#include "internal.h"
+
+static void cpufreq_add_device(struct acpi_device *adev, const char *name)
+{
+ struct platform_device *pdev;
+
+ pdev = platform_device_register_resndata(
+ &adev->dev, name, PLATFORM_DEVID_NONE, NULL, 0, NULL, 0);
+ if (IS_ERR(pdev))
+ dev_err(&adev->dev, "%s platform device creation failed: %ld\n",
+ name, PTR_ERR(pdev));
+}
+
+static acpi_status
+acpi_pct_match(acpi_handle handle, u32 level, void *context,
+ void **return_value)
+{
+ bool *pct = context;
+
+ /* Check if the first CPU has _PCT. The data must be same for all. */
+ *pct = acpi_has_method(handle, "_PCT");
+ return AE_CTRL_TERMINATE;
+}
+
+void __init acpi_cpufreq_init(void)
+{
+ acpi_status status;
+ acpi_handle handle;
+ struct acpi_device *device;
+ bool pct = false;
+
+ status = acpi_get_handle(NULL, "\\_SB", &handle);
+ if (ACPI_FAILURE(status))
+ return;
+
+ device = acpi_fetch_acpi_dev(handle);
+ if (device == NULL)
+ return;
+
+ acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+ ACPI_UINT32_MAX, acpi_pct_match, NULL, &pct, NULL);
+ if (pct)
+ cpufreq_add_device(device, "acpi-cpufreq");
+
+ if (acpi_has_method(handle, "PCCH"))
+ cpufreq_add_device(device, "pcc-cpufreq");
+}
diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index 219c02df9a08..ab02efeaa192 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -55,8 +55,10 @@ static inline void acpi_dock_add(struct acpi_device *adev) {}
#endif
#ifdef CONFIG_X86
void acpi_cmos_rtc_init(void);
+void acpi_cpufreq_init(void);
#else
static inline void acpi_cmos_rtc_init(void) {}
+static inline void acpi_cpufreq_init(void) {}
#endif
int acpi_rev_override_setup(char *str);

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index b47e93a24a9a..e51cf28fc629 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2614,6 +2614,7 @@ void __init acpi_scan_init(void)
if (!acpi_gbl_reduced_hardware)
acpi_bus_scan_fixed();

+ acpi_cpufreq_init();
acpi_turn_off_unused_power_resources();

acpi_scan_initialized = true;
diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 1bb2b90ebb21..1273d42e9ca1 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -1056,18 +1056,5 @@ MODULE_PARM_DESC(acpi_pstate_strict,
late_initcall(acpi_cpufreq_init);
module_exit(acpi_cpufreq_exit);

-static const struct x86_cpu_id __maybe_unused acpi_cpufreq_ids[] = {
- X86_MATCH_FEATURE(X86_FEATURE_ACPI, NULL),
- X86_MATCH_FEATURE(X86_FEATURE_HW_PSTATE, NULL),
- {}
-};
-MODULE_DEVICE_TABLE(x86cpu, acpi_cpufreq_ids);
-
-static const struct acpi_device_id __maybe_unused processor_device_ids[] = {
- {ACPI_PROCESSOR_OBJECT_HID, },
- {ACPI_PROCESSOR_DEVICE_HID, },
- {},
-};
-MODULE_DEVICE_TABLE(acpi, processor_device_ids);
-
MODULE_ALIAS("acpi");
+MODULE_ALIAS("platform:acpi-cpufreq");
diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
index 9f3fc7a073d0..cc898bc3e156 100644
--- a/drivers/cpufreq/pcc-cpufreq.c
+++ b/drivers/cpufreq/pcc-cpufreq.c
@@ -616,12 +616,7 @@ static void __exit pcc_cpufreq_exit(void)
free_percpu(pcc_cpu_info);
}

-static const struct acpi_device_id __maybe_unused processor_device_ids[] = {
- {ACPI_PROCESSOR_OBJECT_HID, },
- {ACPI_PROCESSOR_DEVICE_HID, },
- {},
-};
-MODULE_DEVICE_TABLE(acpi, processor_device_ids);
+MODULE_ALIAS("platform:pcc-cpufreq");

MODULE_AUTHOR("Matthew Garrett, Naga Chumbalkar");
MODULE_VERSION(PCC_VERSION);