[patch 2.6.0-test1] node affine NUMA scheduler extension

From: Erich Focht (efocht@hpce.nec.com)
Date: Fri Jul 18 2003 - 11:29:43 EST

No real change compared to the previous version, patch was only
adapted to fit into 2.6.0-test1. I append the description from my
previous posting.

The patch shows 5-8% gain in the numa_test benchmark on a TX7 Itanium2
machine with 8 CPUs/4 nodes. The interesting numbers are ElapsedTime
and TotalUserTime. In numa_test I changed the PROBLEMSIZE from 1000000
to 2000000 in order to get longer execution/test times. The results
are avergaes over 10 measurements, the standard deviation is in

2.6.0-test1 kernel: original NUMA scheduler

Tasks AverageUserTime ElapsedTime TotalUserTime TotalSysTime
  4 52.67(3.51) 61.30(8.04) 210.70(14.05) 0.16(0.02)
  8 50.29(1.85) 55.19(6.36) 402.38(14.78) 0.34(0.02)
 16 53.27(2.30) 115.30(5.40) 852.40(36.75) 0.62(0.02)
 32 51.92(1.13) 215.98(5.95) 1661.66(36.08) 1.21(0.04)

2.6.0-test1 kernel: node affine NUMA scheduler

Tasks AverageUserTime ElapsedTime TotalUserTime TotalSysTime
  4 50.13(2.09) 56.72(8.46) 200.55(8.34) 0.15(0.01)
  8 49.78(1.29) 54.43(4.90) 398.26(10.31) 0.34(0.02)
 16 50.37(0.96) 110.79(8.46) 806.01(15.33) 0.63(0.03)
 32 51.10(0.51) 210.18(3.27) 1635.40(16.16) 1.23(0.04)

In order to see the UserTime / CPU one needs an additional patch which
gets back the per cpu times in /proc/pid/cpu. The patch comes in a
separate post.

> This patch is an adaptation of the earlier work on the node affine
> NUMA scheduler to the NUMA features meanwhile integrated into
> 2.5. Compared to the patch posted for 2.5.39 this one is much simpler
> and easier to understand.
> The main idea is (still) that tasks are assigned a homenode to which
> they are preferentially scheduled. They are not only sticking as much
> as possible to a node (as in the current 2.5 NUMA scheduler) but will
> also be attracted back to their homenode if they had to be scheduled
> away. Therefore the tasks can be called "affine" to the homenode.
> The implementation is straight forward:
> - Tasks have an additional element in their task structure (node).
> - The scheduler keeps track of the homenodes of the tasks running in
> each node and on each runqueue.
> - At cross-node load balance time nodes/runqueues which run tasks
> originating from the stealer node are preferred. They get a weight
> bonus for each task with the homenode of the stealer.
> - When stealing from a remote node one tries to get the own tasks (if
> any) or tasks from other nodes (if any). This way tasks are kept on
> their homenode as long as possible.
> The selection of the homenode is currently done at initial load
> balancing, i.e. at exec(). A smarter selection method might be needed
> for improving the situation for multithreaded processes. An option is
> the dynamic_homenode patch I posted for 2.5.39 or some other scheme
> based on an RSS/node measure. But that's another story...


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

This archive was generated by hypermail 2b29 : Wed Jul 23 2003 - 22:00:34 EST