Re: [PATCH] timers/nohz: introduce nohz_full_aggressive

From: Steven Rostedt
Date: Sun May 07 2023 - 10:09:06 EST



[ Added Anna-Maria who is doing some timer work as well ]

On Sun, 7 May 2023 11:07:00 +0200
Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:

> Overview:
>
> nohz_full is a feature that allows to reduce the number of CPU tick
> interrupts, thereby improving energy efficiency and reducing kernel
> jitter.

Hmm, I never thought of NOHZ_FULL used for energy efficiency, as the
CPU is still running user space code, and there's really nothing
inherently more power consuming with the tick.

>
> This works by stopping the tick interrupts on the CPUs that are either
> idle or that have only one runnable task on them (there is no reason to
> periodically interrupt the execution of a single running task if none
> else is waiting to acquire the same CPU).
>
> It is not possible to configure all the available CPUs to work in the
> nohz_full mode, at least one non-adaptive-tick CPU must be periodically
> interrupted to properly handle timekeeping tasks in the system (such as
> the gettimeofday() syscall returning accurate values).

Do we really need nohz_full, instead, I think you want to look at what
Anna-Maria is doing with moving the timer "manager" around to make sure
that the tick stays on busy CPUs.

Again, nohz_full is not for power consumption savings, but instead to
reduce kernel interruption in user space.

>
> However, under certain conditions, we may want to relax this constraint,
> accepting potential time inaccuracies in the system, in order to provide
> additional benefits in terms of power consumption, performance and/or
> reduce kernel jitter even more.
>
> For this reason introduce the new parameter nohz_full_aggressive.
>
> This option allows to enforce nozh_full across all the CPUs (even the
> timekeeping CPU) at the cost of having potential timer inaccuracies in
> the system.
>
> Test:
>
> - Hardware: Dell XPS 13 7390 w/ 8 cores
>
> - Kernel is using CONFIG_HZ=1000 (worst case scenario in terms of
> power consumption and kernel jitter) and nohz_full=all
>
> - Measure interrupts and power consumption when the system is idle and
> with 2, 4 and 8 cpu hogs
>
> Result:
>
> The following numbers have been collected using turbostat and dstat
> measuring the average over a 5min run for each test.
>
> irqs/sec idle 1 CPU hog 2 CPU hogs 4 CPU hogs 8 CPU hogs
> ------------------------------------------------------
> nohz_full 1036.679 1047.522 1046.203 1048.590 1074.867
> nohz_full_aggressive 98.685 106.296 127.587 146.586 1062.277
>
> Power (Watt) idle 1 CPU hog 2 CPU hogs 4 CPU hogs 8 CPU hogs
> ------------------------------------------------------
> nohz_full 0.502 W 3.436 W 3.755 W 6.187 W 6.019 W
> nohz_full_aggressive 0.301 W 2.372 W 2.372 W 6.005 W 6.016 W
>
> % power reduction 40.04% 30.97% 36.83% 2.94% 0.05%
>

Nice.

Now I doubt this is acceptable considering the side effects that the
timer inaccuracy can cause. I think this breaks some basic assumptions
in both the kernel and user space.

Now, I think what is really happening here is that you are somewhat
simulating the results that Anna-Maria has indirectly. That is, you
just prevent an idle CPU from waking up to handle interrupts when not
needed.

Anna-Maria,

Do you have some patches that Andrea could test with?

Thanks,

-- Steve

> Conclusion:
>
> nohz_full_aggressive used together with nohz_full=all allows to save
> some energy when the system is idle or under low CPU usage (e.g., when
> less than half of the CPUs are used).
>
> Under high CPU load conditions power consumption is pretty much
> identical to nohz_full=all because the impact of the saved power/irqs on
> the timekeeping CPU doesn't contribute very much to the total energy
> consumption.
>
> However, enabling nohz_full_aggressive can lead to timing inaccuracies
> in the system, because periodic ticks can be disabled also on the
> timekeeping CPU.
>
> Note:
>
> I wrote this patch while I was stuck in the airport, because my flight
> was delayed and I was trying to optimize the battery usage of my laptop
> in more creative ways. Ultimately I ended up wasting a lot more energy
> to test this patch, but at least the long wait wasn't too boring.
>
> Signed-off-by: Andrea Righi <andrea.righi@xxxxxxxxxxxxx>
> ---
> .../ABI/testing/sysfs-devices-system-cpu | 12 ++++++++++++
> .../admin-guide/kernel-parameters.txt | 7 +++++++
> Documentation/timers/no_hz.rst | 5 +++++
> drivers/base/cpu.c | 19 +++++++++++++++++++
> include/linux/tick.h | 7 +++++++
> kernel/time/hrtimer.c | 7 ++++++-
> kernel/time/tick-sched.c | 16 +++++++++++++---
> 7 files changed, 69 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index f54867cadb0f..aa620e154d54 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -679,6 +679,18 @@ Description:
> (RO) the list of CPUs that are in nohz_full mode.
> These CPUs are set by boot parameter "nohz_full=".
>
> +What: /sys/devices/system/cpu/nohz_full_aggressive
> +Date: Apr 2023
> +Contact: Linux kernel mailing list <linux-kernel@xxxxxxxxxxxxxxx>
> +Description:
> + (RW) enable/disable nohz_full also for the timekeeping CPU.
> +
> + WARNING: enabling this option can cause potential
> + high-resolution timer inaccuracies in the system.
> +
> + This option can be set by boot parameter
> + "nohz_full_aggressive".
> +
> What: /sys/devices/system/cpu/isolated
> Date: Apr 2015
> Contact: Linux kernel mailing list <linux-kernel@xxxxxxxxxxxxxxx>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9e5bab29685f..23c6fe20e067 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3732,6 +3732,13 @@
> Note that this argument takes precedence over
> the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
>
> + nohz_full_aggressive
> + [KNL,BOOT,SMP,ISOL] allow to enable nohz_full also for
> + the timekeeping CPU.
> +
> + WARNING: enabling this option can cause potential
> + high-resolution timer inaccuracies in the system.
> +
> noinitrd [RAM] Tells the kernel not to load any configured
> initial RAM disk.
>
> diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> index f8786be15183..aa9f79297d77 100644
> --- a/Documentation/timers/no_hz.rst
> +++ b/Documentation/timers/no_hz.rst
> @@ -136,6 +136,11 @@ error message, and the boot CPU will be removed from the mask. Note that
> this means that your system must have at least two CPUs in order for
> CONFIG_NO_HZ_FULL=y to do anything for you.
>
> +This constraint can be relaxed passing the parameter "nohz_full_aggressive".
> +With this option enabled the timekeeping CPU can be also configured to use
> +non-adaptive ticks, at the cost of having potential high-resolution timer
> +inaccuracies and in the system.
> +
> Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> This is covered in the "RCU IMPLICATIONS" section below.
>
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index c1815b9dae68..b55d6111a733 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -280,6 +280,24 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
> return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask));
> }
> static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
> +
> +static ssize_t
> +nohz_full_aggressive_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", tick_nohz_full_aggressive);
> +}
> +
> +static ssize_t nohz_full_aggressive_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + if (kstrtobool(buf, &tick_nohz_full_aggressive))
> + return -EINVAL;
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(nohz_full_aggressive);
> #endif
>
> static void cpu_device_release(struct device *dev)
> @@ -468,6 +486,7 @@ static struct attribute *cpu_root_attrs[] = {
> &dev_attr_isolated.attr,
> #ifdef CONFIG_NO_HZ_FULL
> &dev_attr_nohz_full.attr,
> + &dev_attr_nohz_full_aggressive.attr,
> #endif
> #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
> &dev_attr_modalias.attr,
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 9459fef5b857..8d557838b3f6 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -176,6 +176,7 @@ static inline void tick_nohz_idle_stop_tick_protected(void) { }
>
> #ifdef CONFIG_NO_HZ_FULL
> extern bool tick_nohz_full_running;
> +extern bool tick_nohz_full_aggressive;
> extern cpumask_var_t tick_nohz_full_mask;
>
> static inline bool tick_nohz_full_enabled(void)
> @@ -186,6 +187,11 @@ static inline bool tick_nohz_full_enabled(void)
> return tick_nohz_full_running;
> }
>
> +static inline bool tick_nohz_full_aggressive_enabled(void)
> +{
> + return !!tick_nohz_full_aggressive;
> +}
> +
> /*
> * Check if a CPU is part of the nohz_full subset. Arrange for evaluating
> * the cpu expression (typically smp_processor_id()) _after_ the static
> @@ -276,6 +282,7 @@ extern void __tick_nohz_task_switch(void);
> extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
> #else
> static inline bool tick_nohz_full_enabled(void) { return false; }
> +static inline bool tick_nohz_full_aggressive_enabled(void) { return false; }
> static inline bool tick_nohz_full_cpu(int cpu) { return false; }
> static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index e8c08292defc..b3f27c6c8475 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1866,7 +1866,12 @@ void hrtimer_interrupt(struct clock_event_device *dev)
> else
> expires_next = ktime_add(now, delta);
> tick_program_event(expires_next, 1);
> - pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
> + /*
> + * This is a "normal" condition when nohz_full_aggressive mode is
> + * enabled, so avoid printing this warning in this case.
> + */
> + if (!tick_nohz_full_aggressive_enabled())
> + pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
> }
>
> /* called with interrupts disabled */
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 52254679ec48..8864066e4746 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -188,7 +188,8 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> */
> if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) {
> #ifdef CONFIG_NO_HZ_FULL
> - WARN_ON_ONCE(tick_nohz_full_running);
> + if (!tick_nohz_full_aggressive_enabled())
> + WARN_ON_ONCE(tick_nohz_full_running);
> #endif
> tick_do_timer_cpu = cpu;
> }
> @@ -250,6 +251,8 @@ cpumask_var_t tick_nohz_full_mask;
> EXPORT_SYMBOL_GPL(tick_nohz_full_mask);
> bool tick_nohz_full_running;
> EXPORT_SYMBOL_GPL(tick_nohz_full_running);
> +bool tick_nohz_full_aggressive;
> +EXPORT_SYMBOL_GPL(tick_nohz_full_aggressive);
> static atomic_t tick_dep_mask;
>
> static bool check_tick_dependency(atomic_t *dep)
> @@ -524,6 +527,13 @@ void __tick_nohz_task_switch(void)
> }
> }
>
> +static int __init tick_nohz_full_aggressive_setup(char *str)
> +{
> + tick_nohz_full_aggressive = true;
> + return 1;
> +}
> +__setup("nohz_full_aggressive", tick_nohz_full_aggressive_setup);
> +
> /* Get the boot-time nohz CPU list from the kernel parameters. */
> void __init tick_nohz_full_setup(cpumask_var_t cpumask)
> {
> @@ -854,7 +864,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> * Otherwise we can sleep as long as we want.
> */
> delta = timekeeping_max_deferment();
> - if (cpu != tick_do_timer_cpu &&
> + if ((tick_nohz_full_aggressive_enabled() || cpu != tick_do_timer_cpu) &&
> (tick_do_timer_cpu != TICK_DO_TIMER_NONE || !ts->do_timer_last))
> delta = KTIME_MAX;
>
> @@ -1073,7 +1083,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> if (unlikely(report_idle_softirq()))
> return false;
>
> - if (tick_nohz_full_enabled()) {
> + if (tick_nohz_full_enabled() && !tick_nohz_full_aggressive_enabled()) {
> /*
> * Keep the tick alive to guarantee timekeeping progression
> * if there are full dynticks CPUs around