Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

From: Quan Xu
Date: Fri Nov 17 2017 - 06:24:10 EST




On 2017-11-16 17:53, Thomas Gleixner wrote:
On Thu, 16 Nov 2017, Quan Xu wrote:
On 2017-11-16 06:03, Thomas Gleixner wrote:
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ target_state = &drv->states[index];
ÂÂÂÂÂÂÂ }

+#ifdef CONFIG_PARAVIRT
+ÂÂÂÂÂÂ paravirt_idle_poll();
+
+ÂÂÂÂÂÂ if (need_resched())
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return -EBUSY;
+#endif
That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.


Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can implement it
as easy as following:

ÂÂÂÂÂÂÂÂÂÂÂÂ --- a/arch/x86/kernel/kvm.c
ÂÂÂÂÂÂÂÂÂÂÂÂ +++ b/arch/x86/kernel/kvm.c
ÂÂÂÂÂÂÂÂÂÂÂÂ @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ update_intr_gate(X86_TRAP_PF, async_page_fault);
ÂÂÂÂÂÂÂÂÂÂÂÂÂ }

ÂÂÂÂÂÂÂÂÂÂÂÂ +static __cpuidle void kvm_safe_halt(void)
ÂÂÂÂÂÂÂÂÂÂÂÂ +{
ÂÂÂ ÂÂÂÂ +ÂÂÂÂÂÂÂ /* 1. POLL, if need_resched() --> return */
ÂÂÂ ÂÂÂÂ +
ÂÂÂÂÂÂÂÂÂÂÂÂ +ÂÂÂÂÂÂÂ asm volatile("sti; hlt": : :"memory"); /* 2. halt */
ÂÂÂÂÂÂÂÂÂÂÂÂ +
ÂÂÂ ÂÂÂÂ +ÂÂÂÂÂÂÂ /* 3. get the last idle state's residency time */
ÂÂÂÂÂÂÂÂÂÂÂÂ +
ÂÂÂ ÂÂÂÂ +ÂÂÂÂÂÂÂ /* 4. update poll duration based on last idle state's residency time */
ÂÂÂÂÂÂÂÂÂÂÂÂ +}
ÂÂÂÂÂÂÂÂÂÂÂÂ +
ÂÂÂÂÂÂÂÂÂÂÂÂÂ void __init kvm_guest_init(void)
ÂÂÂÂÂÂÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ int i;
ÂÂÂÂÂÂÂÂÂÂÂÂ @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (kvmclock_vsyscall)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ kvm_setup_vsyscall_timeinfo();

ÂÂÂÂÂÂÂÂÂÂÂÂ +ÂÂÂÂÂÂ pv_irq_ops.safe_halt = kvm_safe_halt;
ÂÂÂÂÂÂÂÂÂÂÂÂ +
ÂÂÂÂÂÂÂÂÂÂÂÂÂ #ifdef CONFIG_SMP




then, I am no need to introduce a new pvops, and never modify schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud