Re: The tick is active on idle adaptive-tick CPUs when /dev/cpu_dma_latency is used

From: Rafael J. Wysocki
Date: Sat Jul 06 2019 - 07:06:43 EST


On Saturday, July 6, 2019 10:17:15 AM CEST Rafael J. Wysocki wrote:
> On Fri, Jul 5, 2019 at 7:22 PM Thomas Lindroth
> <thomas.lindroth@xxxxxxxxx> wrote:
> >
> > On recent kernels the tick remains active on idle adaptive-tick CPUs when a small
> > value is written to /dev/cpu_dma_latency to restrict the highest C-state. Before the
> > idle loop redesign in 4.17 idle CPUs had the tick disabled even when C-state were
> > restricted. Is this change intentional or a regression?
>
> It was intentional, but this kind of is a gray area.
>
> At least for the menu governor you may argue that the decision on
> whether or not to stop the tick should be based on the predicted idle
> duration.

But also see below.

> > I use an x86_64 system built with CONFIG_NO_HZ_FULL that I recently upgraded to the 4.19 series from the 4.14 series.
> > I noticed that adaptive-tick CPUs (nohz_full=1-7) still fire timer interrupts about 1000 times/s (CONFIG_HZ_1000=y) even
> > when they are mostly idle. Some debugging showed that this only happens when a program is writing to
> > /dev/cpu_dma_latency to restrict C-states. The old 4.14 kernel only have around 10 timer interrupts per second on idle
> > adaptive-tick CPU even when C-states are restricted that way.
> >
> > I would expect an adaptive-tick CPU to turn off the tick when it has 0 or 1 processes to run and enable the tick for >2
> > processes. Kernels after 4.17 instead have the tick on when 0 or >2 processes are running and the tick off in the 1 process
> > case. Since the tick is off when a single process is running that workload isn't directly harmed by the change but if the CPU
> > use hyperthreading the constant wakeups on an idle HT sibling will reduce performance on the other sibling.

So it looks like the idle loop needs a special case for adaptive-tick CPUs.

> >
> > They way I look for timer interrupts is by comparing the LOC line in /proc/interrupts or using the hrtimer_expire_entry
> > tracepoint when function=tick_sched_timer. Both methods seem to give the same results.
> >
> > I can reproduce the problem using an i7-4790K CPU with /sys/devices/system/cpu/cpuidle/current_driver:intel_idle. I can
> > also reproduce the problem on an old core2duo laptop with current_driver:acpi_idle but I can't reproduce the problem
> > in a virtual machine where current_driver:none. I also can't reproduce the problem if C-states are restricted using the
> > intel_idle.max_cstate=1 kernel argument instead of /dev/cpu_dma_latency.
> >
> > The commit that introduced the change is 554c8aa8ec "sched: idle: Select idle state before stopping the tick" in v4.17
> > and the problem exists at least up to kernel 5.1 using the menu cpuidle governor.
>
> Restoring the previous behavior in this case should be relatively
> straightforward. I'll send you a patch to do that later.

The patch is below, but note that it adds the tick stopping overhead to the idle loop
for CPUs that are not adaptive-tick and when PM QoS latency constraints are used
which is not desirable in general.

Please test it, but as I said above, the real solution appears to be to treat adaptive-tick
CPUs in a special way in the idle loop.

---
drivers/cpuidle/governors/menu.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -302,9 +302,10 @@ static int menu_select(struct cpuidle_dr
!drv->states[0].disabled && !dev->states_usage[0].disable)) {
/*
* In this case state[0] will be used no matter what, so return
- * it right away and keep the tick running.
+ * it right away and keep the tick running if state[0] is a
+ * polling one.
*/
- *stop_tick = false;
+ *stop_tick = !!(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
return 0;
}

@@ -395,16 +396,9 @@ static int menu_select(struct cpuidle_dr

return idx;
}
- if (s->exit_latency > latency_req) {
- /*
- * If we break out of the loop for latency reasons, use
- * the target residency of the selected state as the
- * expected idle duration so that the tick is retained
- * as long as that target residency is low enough.
- */
- predicted_us = drv->states[idx].target_residency;
+ if (s->exit_latency > latency_req)
break;
- }
+
idx = i;
}