Minature NUMA scheduler

From: Martin J. Bligh (mbligh@aracnet.com)
Date: Thu Jan 09 2003 - 18:54:08 EST


I tried a small experiment today - did a simple restriction of
the O(1) scheduler to only balance inside a node. Coupled with
the small initial load balancing patch floating around, this
covers 95% of cases, is a trivial change (3 lines), performs
just as well as Erich's patch on a kernel compile, and actually
better on schedbench.

This is NOT meant to be a replacement for the code Erich wrote,
it's meant to be a simple way to get integration and acceptance.
Code that just forks and never execs will stay on one node - but
we can take the code Erich wrote, and put it in seperate rebalancer
that fires much less often to do a cross-node rebalance. All that
would be under #ifdef CONFIG_NUMA, the only thing that would touch
mainline is these three lines of change, and it's trivial to see
they're completely equivalent to the current code on non-NUMA systems.

I also believe that this is the more correct approach in design, it
should result in much less cross-node migration of tasks, and less
scanning of remote runqueues.

Opinions / comments?

M.

Kernbench:
                                   Elapsed User System CPU
                   2.5.54-mjb3 19.41s 186.38s 39.624s 1191.4%
          2.5.54-mjb3-mjbsched 19.508s 186.356s 39.888s 1164.6%

Schedbench 4:
                                   AvgUser Elapsed TotalUser TotalSys
                   2.5.54-mjb3 0.00 35.14 88.82 0.64
          2.5.54-mjb3-mjbsched 0.00 31.84 88.91 0.49

Schedbench 8:
                                   AvgUser Elapsed TotalUser TotalSys
                   2.5.54-mjb3 0.00 47.55 269.36 1.48
          2.5.54-mjb3-mjbsched 0.00 41.01 252.34 1.07

Schedbench 16:
                                   AvgUser Elapsed TotalUser TotalSys
                   2.5.54-mjb3 0.00 76.53 957.48 4.17
          2.5.54-mjb3-mjbsched 0.00 69.01 792.71 2.74

Schedbench 32:
                                   AvgUser Elapsed TotalUser TotalSys
                   2.5.54-mjb3 0.00 145.20 1993.97 11.05
          2.5.54-mjb3-mjbsched 0.00 117.47 1798.93 5.95

Schedbench 64:
                                   AvgUser Elapsed TotalUser TotalSys
                   2.5.54-mjb3 0.00 307.80 4643.55 20.36
          2.5.54-mjb3-mjbsched 0.00 241.04 3589.55 12.74

-----------------------------------------

diff -purN -X /home/mbligh/.diff.exclude virgin/kernel/sched.c mjbsched/kernel/sched.c
--- virgin/kernel/sched.c Mon Dec 9 18:46:15 2002
+++ mjbsched/kernel/sched.c Thu Jan 9 14:09:17 2003
@@ -654,7 +654,7 @@ static inline unsigned int double_lock_b
 /*
  * find_busiest_queue - find the busiest runqueue.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
         int nr_running, load, max_load, i;
         runqueue_t *busiest, *rq_src;
@@ -689,7 +689,7 @@ static inline runqueue_t *find_busiest_q
         busiest = NULL;
         max_load = 1;
         for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1 << i) & cpumask) )
                         continue;
 
                 rq_src = cpu_rq(i);
@@ -764,7 +764,8 @@ static void load_balance(runqueue_t *thi
         struct list_head *head, *curr;
         task_t *tmp;
 
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+ __node_to_cpu_mask(__cpu_to_node(this_cpu)) );
         if (!busiest)
                 goto out;
 

---------------------------------------------------

A tiny change in the current ilb patch is also needed to stop it
using a macro from the first patch:

diff -purN -X /home/mbligh/.diff.exclude ilbold/kernel/sched.c ilbnew/kernel/sched.c
--- ilbold/kernel/sched.c Thu Jan 9 15:20:53 2003
+++ ilbnew/kernel/sched.c Thu Jan 9 15:27:49 2003
@@ -2213,6 +2213,7 @@ static void sched_migrate_task(task_t *p
 static int sched_best_cpu(struct task_struct *p)
 {
         int i, minload, load, best_cpu, node = 0;
+ unsigned long cpumask;
 
         best_cpu = task_cpu(p);
         if (cpu_rq(best_cpu)->nr_running <= 2)
@@ -2226,9 +2227,11 @@ static int sched_best_cpu(struct task_st
                         node = i;
                 }
         }
+
         minload = 10000000;
- loop_over_node(i,node) {
- if (!cpu_online(i))
+ cpumask = __node_to_cpu_mask(node);
+ for (i = 0; i < NR_CPUS; ++i) {
+ if (!(cpumask & (1 << i)))
                         continue;
                 if (cpu_rq(i)->nr_running < minload) {
                         best_cpu = i;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jan 15 2003 - 22:00:30 EST