Re: acpi_idle: Very idle Core i7 machine never enters C3

From: Jeff Garrett
Date: Fri Feb 05 2010 - 15:54:17 EST


On Fri, Feb 05, 2010 at 12:45:21PM -0500, Len Brown wrote:
> Jeff,
> What do you see if you apply just the patch below?
>
> Also, in addition to "powertop -d" to show what the kernel requests,
> please run turbostat to show what the hardware actually did:
>
> http://userweb.kernel.org/~lenb/acpi/utils/pmtools-latest/turbostat/turbostat.c
>
> eg.
> # turbostat -d -v sleep 5
>
> thanks,
> -Len Brown, Intel Open Source Technology Center

With the patch, powertop reports good C3 residency and wakeups
remain very low. Seems to work. :)

I attached the powertop & turbostat output with this patch.

However, this confuses me. In a previous experiment in the
acpi_processor_setup_cpuidle() function, I replaced the pointer
to acpi_idle_enter_bm() with a pointer to
acpi_idle_enter_simple() even when bm_check is nonzero. With
that, I was able to get into C3, but the wakeups ballooned. But
the difference between what I did, and what you did, is the
difference between acpi_idle_enter_bm() with acpi_idle_bm_check()
returning zero and acpi_idle_enter_simple(). Those code paths
look almost identical. The bm path calls acpi_unlazy_tlb(), and
doesn't appear to call the ACPI_FLUSH_CPU_CACHE(), and they call
sched_clock_idle_sleep_event() in different places. I don't
understand why any of these differences would have had any
significant effect on wakeups.

I'm left wondering if it's a problem on my part. I should repeat
that previous experiment and see if there really is something
significantly different there.

BTW, getting a bit off topic, but since the two code paths are
almost identical, is there any reason not to unite them?
Something like the attached patch might work?

Thanks,
Jeff
PowerTOP 1.12 (C) 2007, 2008 Intel Corporation

Collecting data for 15 seconds


Cn Avg residency
C0 (cpu running) ( 2.7%)
polling 0.0ms ( 0.0%)
C1 mwait 0.0ms ( 0.0%)
C2 mwait 0.0ms ( 0.0%)
C3 mwait 151.8ms (97.3%)
P-states (frequencies)
Turbo Mode 0.0%
2.67 Ghz 0.0%
2.54 Ghz 0.0%
2.40 Ghz 0.0%
1.60 Ghz 100.0%
Wakeups-from-idle per second : 6.4 interval: 15.0s
no ACPI power usage estimate available
Top causes for wakeups:
38.4% ( 11.2) [extra timer interrupt]
29.7% ( 8.7) [kernel scheduler] Load balancing tick
11.2% ( 3.3) [kernel core] hrtimer_start (tick_sched_timer)
7.1% ( 2.1) [Rescheduling interrupts] <kernel IPI>
6.9% ( 2.0) [kernel core] add_timer_on (clocksource_watchdog)
1.8% ( 0.5) events/3
1.6% ( 0.5) [eth0] <interrupt>
0.7% ( 0.2) [kernel core] __mod_timer (dev_watchdog)
0.5% ( 0.1) [ata_piix, ata_piix] <interrupt>
0.2% ( 0.1) [kernel core] __mod_timer (tcp_delack_timer)
0.2% ( 0.1) sshd
0.2% ( 0.1) [kernel core] enqueue_task_rt (sched_rt_period_timer)
0.2% ( 0.1) expr
0.2% ( 0.1) [kernel core] __mod_timer (cfq_idle_slice_timer)
0.2% ( 0.1) [kernel core] queue_delayed_work (delayed_work_timer_fn)
0.2% ( 0.1) [kernel core] __mod_timer (peer_check_expire)
0.2% ( 0.1) [kernel core] __mod_timer (neigh_timer_handler)
0.2% ( 0.1) [kernel core] __mod_timer (sync_supers_timer_fn)

Suggestion: Enable the CONFIG_USB_SUSPEND kernel configuration option.
This option will automatically disable UHCI USB when not in use, and may
save approximately 1 Watt of power.

Recent USB suspend statistics
Active Device name

Recent audio activity statistics
Active Device name

Recent SATA AHCI link activity statistics
Active Partial Slumber Device name
turbostat Jan-27, 2010 - Len Brown <lenb@xxxxxxxxxx>
num_cpus 8
CPUID GenuineIntel 11 levels family:model:stepping 6:26:4
Nehalem multiplier 20, TSC frequency 2667 MHz
MSR_NEHALEM_PLATFORM_INFO: 0xc0000001400
Nehalem 4 cores active: 21 mult, max turbo frequency = 2800 MHz
Nehalem 3 cores active: 21 mult, max turbo frequency = 2800 MHz
Nehalem 2 cores active: 21 mult, max turbo frequency = 2800 MHz
Nehalem 1 core active: 22 mult, max turbo frequency = 2933 MHz
5.002349 sec
CPU GHz TSC %c0 %c1 %c3 %c6 %pc3 %pc6 %pc7
0 1.60 2.66 0.02 0.07 0.00 99.92 0.00 0.00 0.00
1 1.60 2.66 0.01 0.02 0.00 99.97 0.00 0.00 0.00
2 1.60 2.66 0.02 0.04 0.00 99.94 0.00 0.00 0.00
3 1.60 2.66 0.09 0.13 0.00 99.78 0.00 0.00 0.00
4 1.60 2.66 0.05 0.04 0.00 99.91 0.00 0.00 0.00
5 1.60 2.66 0.01 0.02 0.00 99.97 0.00 0.00 0.00
6 1.60 2.66 0.01 0.05 0.00 99.94 0.00 0.00 0.00
7 1.59 2.66 0.05 0.16 0.00 99.79 0.00 0.00 0.00
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 7c0441f..8c636de 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -849,73 +851,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
return idle_time;
}

-/**
- * acpi_idle_enter_simple - enters an ACPI state without BM handling
- * @dev: the target CPU
- * @state: the state data
- */
-static int acpi_idle_enter_simple(struct cpuidle_device *dev,
- struct cpuidle_state *state)
-{
- struct acpi_processor *pr;
- struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
- ktime_t kt1, kt2;
- s64 idle_time;
- s64 sleep_ticks = 0;
-
- pr = __get_cpu_var(processors);
-
- if (unlikely(!pr))
- return 0;
-
- if (acpi_idle_suspend)
- return(acpi_idle_enter_c1(dev, state));
-
- local_irq_disable();
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we test
- * NEED_RESCHED:
- */
- smp_mb();
-
- if (unlikely(need_resched())) {
- current_thread_info()->status |= TS_POLLING;
- local_irq_enable();
- return 0;
- }
-
- /*
- * Must be done before busmaster disable as we might need to
- * access HPET !
- */
- lapic_timer_state_broadcast(pr, cx, 1);
-
- if (cx->type == ACPI_STATE_C3)
- ACPI_FLUSH_CPU_CACHE();
-
- kt1 = ktime_get_real();
- /* Tell the scheduler that we are going deep-idle: */
- sched_clock_idle_sleep_event();
- acpi_idle_do_entry(cx);
- kt2 = ktime_get_real();
- idle_time = ktime_to_us(ktime_sub(kt2, kt1));
-
- sleep_ticks = us_to_pm_timer_ticks(idle_time);
-
- /* Tell the scheduler how much we idled: */
- sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
-
- local_irq_enable();
- current_thread_info()->status |= TS_POLLING;
-
- cx->usage++;
-
- lapic_timer_state_broadcast(pr, cx, 0);
- cx->time += sleep_ticks;
- return idle_time;
-}
-
static int c3_cpu_count;
static DEFINE_SPINLOCK(c3_lock);

@@ -944,7 +879,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
if (acpi_idle_suspend)
return(acpi_idle_enter_c1(dev, state));

- if (acpi_idle_bm_check()) {
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check && acpi_idle_bm_check()) {
if (dev->safe_state) {
dev->last_state = dev->safe_state;
return dev->safe_state->enter(dev, dev->safe_state);
@@ -970,17 +905,24 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
return 0;
}

- acpi_unlazy_tlb(smp_processor_id());
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check)
+ acpi_unlazy_tlb(smp_processor_id());

/* Tell the scheduler that we are going deep-idle: */
- sched_clock_idle_sleep_event();
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check)
+ sched_clock_idle_sleep_event();
/*
* Must be done before busmaster disable as we might need to
* access HPET !
*/
lapic_timer_state_broadcast(pr, cx, 1);

+ if (cx->type == ACPI_STATE_C3 && !pr->flags.bm_check)
+ ACPI_FLUSH_CPU_CACHE();
+
kt1 = ktime_get_real();
+ if (cx->type != ACPI_STATE_C3 || !pr->flags.bm_check)
+ sched_clock_idle_sleep_event();
/*
* disable bus master
* bm_check implies we need ARB_DIS
@@ -991,21 +933,19 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
* not set. In that case we cannot do much, we enter C3
* without doing anything.
*/
- if (pr->flags.bm_check && pr->flags.bm_control) {
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check && pr->flags.bm_control) {
spin_lock(&c3_lock);
c3_cpu_count++;
/* Disable bus master arbitration when all CPUs are in C3 */
if (c3_cpu_count == num_online_cpus())
acpi_write_bit_register(ACPI_BITREG_ARB_DISABLE, 1);
spin_unlock(&c3_lock);
- } else if (!pr->flags.bm_check) {
- ACPI_FLUSH_CPU_CACHE();
}

acpi_idle_do_entry(cx);

/* Re-enable bus master arbitration */
- if (pr->flags.bm_check && pr->flags.bm_control) {
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check && pr->flags.bm_control) {
spin_lock(&c3_lock);
acpi_write_bit_register(ACPI_BITREG_ARB_DISABLE, 0);
c3_cpu_count--;
@@ -1095,7 +1035,7 @@ static int acpi_processor_setup_cpuidle(struct acpi_processor *pr)
case ACPI_STATE_C2:
state->flags |= CPUIDLE_FLAG_BALANCED;
state->flags |= CPUIDLE_FLAG_TIME_VALID;
- state->enter = acpi_idle_enter_simple;
+ state->enter = acpi_idle_enter_bm;
dev->safe_state = state;
break;

@@ -1103,9 +1043,7 @@ static int acpi_processor_setup_cpuidle(struct acpi_processor *pr)
state->flags |= CPUIDLE_FLAG_DEEP;
state->flags |= CPUIDLE_FLAG_TIME_VALID;
state->flags |= CPUIDLE_FLAG_CHECK_BM;
- state->enter = pr->flags.bm_check ?
- acpi_idle_enter_bm :
- acpi_idle_enter_simple;
+ state->enter = acpi_idle_enter_bm;
break;
}