Re: [RFC] Extend mwait idle to optimize away IPIs when possible

From: Venki Pallipadi
Date: Thu Feb 09 2012 - 21:17:11 EST


On Wed, Feb 8, 2012 at 6:18 PM, Yong Zhang <yong.zhang0@xxxxxxxxx> wrote:
> On Wed, Feb 08, 2012 at 03:28:45PM -0800, Venki Pallipadi wrote:
>> On Tue, Feb 7, 2012 at 10:51 PM, Yong Zhang <yong.zhang0@xxxxxxxxx> wrote:
>> > On Mon, Feb 06, 2012 at 12:42:13PM -0800, Venkatesh Pallipadi wrote:
>> >> smp_call_function_single and ttwu_queue_remote sends unconditional IPI
>> >> to target CPU. However, if the target CPU is in mwait based idle, we can
>> >> do IPI-less wakeups using the magical powers of monitor-mwait.
>> >> Doing this has certain advantages:
>> >
>> > Actually I'm trying to do the similar thing on MIPS.
>> >
>> > The difference is that I want task_is_polling() to do something. The basic
>> > idea is:
>> >
>> >> + ? ? ? ? ? ? ? ? ? ? if (ipi_pending()) {
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? clear_ipi_pending();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? local_bh_disable();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? local_irq_disable();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? generic_smp_call_function_single_interrupt();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? scheduler_wakeup_self_check();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? local_irq_enable();
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? local_bh_enable();
>> >
>> > I let cpu_idle() check if there is anything to do as your above code.
>> >
>> > And task_is_polling() handle the others with below patch:
>> > ---
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 5255c9d..09f633d 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -527,15 +527,16 @@ void resched_task(struct task_struct *p)
>> > ? ? ? ? ? ? ? ?smp_send_reschedule(cpu);
>> > ?}
>> >
>> > -void resched_cpu(int cpu)
>> > +int resched_cpu(int cpu)
>> > ?{
>> > ? ? ? ?struct rq *rq = cpu_rq(cpu);
>> > ? ? ? ?unsigned long flags;
>> >
>> > ? ? ? ?if (!raw_spin_trylock_irqsave(&rq->lock, flags))
>> > - ? ? ? ? ? ? ? return;
>> > + ? ? ? ? ? ? ? return 0;
>> > ? ? ? ?resched_task(cpu_curr(cpu));
>> > ? ? ? ?raw_spin_unlock_irqrestore(&rq->lock, flags);
>> > + ? ? ? return 1;
>> > ?}
>>
>
> I assume we are talking about 'return from idle' but seems I don't
> make it clear.
>
>> Two points -
>> rq->lock: I tried something similar first. One hurdle with checking
>> task_is_polling() is that you need rq->lock to check it. And adding
>> lock+unlock (without wait) in wakeup path ended up being no net gain
>> compared to IPI. And when we actually end up spinning on that lock,
>> thats going to add overhead in the common path. That is the reason I
>> switched to atomic compare exchange and moving any wait onto the
>> target CPU coming out of idle.
>
> I see. But actually we will not spinning on that lock because we
> use 'trylock' in resched_cpu().

Ahh. Sorry I missed the trylock in there...

> And you are right there is indeed a
> little overhead (resched_task()) if we hold the lock but it can be
> tolerated IMHO.

One advantage I got by using atomic stuff instead of rq->lock was as I
mentioned in the patch description, if 2 CPUs are trying to send IPI
to same target CPU around same time (50-100 us if CPU is in deep
C-state in x86).

>
> BTW, mind showing you test case thus we can collect some common data?

Test case was a silly clock measure around
__smp_call_function_single() with optimization I had in
generic_exec_single(). Attaching the patch I had..

>>
>> resched_task: ttwu_queue_remote() does not imply that the remote CPU
>> will do a resched. Today there is a IPI and IPI handler calls onto
>> check_preempt_wakeup() and if the current task has higher precedence
>> than the waking up task, then there will be just an activation of new
>> task and no resched. Using resched_task above breaks
>> check_preempt_wakeup() and always calls a resched on remote CPU after
>> the IPI, which would be change in behavior.
>
> Yeah, if the remote cpu is not idle, mine will change the behavior; but
> if the remote cpu is idle, it will always rescheduled, right?
>
> So maybe we could introduce resched_idle_cpu() to make things more clear:
>
> int resched_idle_cpu(int cpu)
> {
>        struct rq *rq = cpu_rq(cpu);
>        unsigned long flags;
>        int ret = 0;
>
>        if (!raw_spin_trylock_irqsave(&rq->lock, flags))
>                goto out;
>        if (!idle_cpu(cpu))
>                goto out_unlock;
>        resched_task(cpu_curr(cpu));
>                ret = 1;
> out_unlock:
>        raw_spin_unlock_irqrestore(&rq->lock, flags);
> out:
>        return ret;
> }
>

This should likely work. But, if you do want to use similar logic in
smp_call_function() or idle load balance kick etc, you need additional
bit other than need_resched() as there we only need irq+softirq and
not necessarily a resched.
At this time I am not sure how poll wakeup logic works in MIPS. But,
if it is something that is similar to x86 mwait and we can wakeup with
a bit other than TIF_NEED_RESCHED, we can generalize most of the
changes in my RFC and share it across archs.

-Venki

>>
>> >
>> > ?#ifdef CONFIG_NO_HZ
>> > @@ -1484,7 +1485,8 @@ void scheduler_ipi(void)
>> >
>> > ?static void ttwu_queue_remote(struct task_struct *p, int cpu)
>> > ?{
>> > - ? ? ? if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
>> > + ? ? ? if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list) &&
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? !resched_cpu(cpu))
>> > ? ? ? ? ? ? ? ?smp_send_reschedule(cpu);
>> > ?}
>> >
>> > Thought?
>> >
>> > Thanks,
>> > Yong
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> Only stand for myself
From fd0f349bffdf61fda8a8085b435ec40d9ddfba33 Mon Sep 17 00:00:00 2001
From: Venkatesh Pallipadi <venki@xxxxxxxxxx>
Date: Wed, 1 Feb 2012 17:02:59 -0800
Subject: [PATCH 3/5] test: ipicost test routine

Silly test to measure ipicost.
$ taskset 0x1 cat /proc/ipicost; dmesg | tail -$(($(grep processor /proc/cpuinfo | wc -l)*2))

Signed-off-by: Venkatesh Pallipadi <venki@xxxxxxxxxx>
---
fs/proc/Makefile | 2 +-
fs/proc/ipicost.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 108 insertions(+), 1 deletions(-)
create mode 100644 fs/proc/ipicost.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index c1c7293..4407c6f 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -8,7 +8,7 @@ proc-y := nommu.o task_nommu.o
proc-$(CONFIG_MMU) := mmu.o task_mmu.o

proc-y += inode.o root.o base.o generic.o array.o \
- proc_tty.o
+ proc_tty.o ipicost.o
proc-y += cmdline.o
proc-y += consoles.o
proc-y += cpuinfo.o
diff --git a/fs/proc/ipicost.c b/fs/proc/ipicost.c
new file mode 100644
index 0000000..967201d
--- /dev/null
+++ b/fs/proc/ipicost.c
@@ -0,0 +1,107 @@
+#include <linux/fs.h>
+#include <linux/smp.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+
+#define REP_COUNT 100
+
+static int dummy_count;
+static u64 recv_sum;
+
+static void dummy_ipi(void *intime)
+{
+ u64 start, curr;
+ start = (u64)intime;
+ rdtscll(curr);
+ recv_sum += (curr - start);
+ dummy_count++;
+}
+
+static int show_ipicost(struct seq_file *m, void *v)
+{
+ int i;
+ int count;
+ struct call_single_data csd;
+
+ csd.flags = 0;
+ csd.func = &dummy_ipi;
+ csd.info = NULL;
+
+ for_each_online_cpu(i) {
+ u64 start, stop, sum;
+
+ sum = 0;
+ recv_sum = 0;
+ dummy_count = 0;
+ for (count = 0; count < REP_COUNT; count++) {
+ rdtscll(start);
+ csd.info = (void *)start;
+ __smp_call_function_single(i, &csd, 0);
+ rdtscll(stop);
+ sum += (stop - start);
+ msleep(1);
+ }
+ printk("0 CPU %d, time %Lu, recv %Lu, count %d\n", i, sum / REP_COUNT, recv_sum / REP_COUNT, dummy_count);
+ }
+
+ for_each_online_cpu(i) {
+ u64 start, stop, sum;
+
+ sum = 0;
+ recv_sum = 0;
+ dummy_count = 0;
+ for (count = 0; count < REP_COUNT; count++) {
+ rdtscll(start);
+ csd.info = (void *)start;
+ __smp_call_function_single(i, &csd, 1);
+ rdtscll(stop);
+ sum += (stop - start);
+ msleep(1);
+ }
+ printk("1 CPU %d, time %Lu, recv %Lu, count %d\n", i, sum / REP_COUNT, recv_sum / REP_COUNT, dummy_count);
+ }
+ return 0;
+}
+
+static void *c_start(struct seq_file *m, loff_t *pos)
+{
+ return (void *)1;
+}
+
+static void *c_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return NULL;
+}
+
+static void c_stop(struct seq_file *m, void *v)
+{
+}
+
+static const struct seq_operations ipicost_op = {
+ .start = c_start,
+ .next = c_next,
+ .stop = c_stop,
+ .show = show_ipicost,
+};
+
+static int ipicost_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &ipicost_op);
+}
+
+static const struct file_operations proc_ipicost_operations = {
+ .open = ipicost_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init proc_ipicost_init(void)
+{
+ proc_create("ipicost", 0, NULL, &proc_ipicost_operations);
+ return 0;
+}
+module_init(proc_ipicost_init);
--
1.7.7.3