Re: NUMA scheduler 2nd approach

From: Michael Hohnbaum (hohnbaum@us.ibm.com)
Date: Mon Jan 13 2003 - 20:23:56 EST


On Sun, 2003-01-12 at 15:55, Erich Focht wrote:
> Hi Martin & Michael,
>
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

Erich,

I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
of the time working with the first three patches. What I found was
that rebalancing was happening too much between nodes. I tried a
few things to change this, but have not yet settled on the best
approach. A key item to work with is the check in find_busiest_node
to determine if the found node is busier enough to warrant stealing
from it. Currently the check is that the node has 125% of the load
of the current node. I think that, for my system at least, we need
to add in a constant to this equation. I tried using 4 and that
helped a little. Finally I added in the 04 patch, and that helped
a lot. Still, there is too much process movement between nodes.

Tomorrow, I will continue experiments, but base them on the first
4 patches. Two suggestions for minor changes:

* Make the check in find_busiest_node into a macro that is defined
  in the arch specific topology header file. Then different NUMA
  architectures can tune this appropriately.

* In find_busiest_queue change:

        cpumask |= __node_to_cpu_mask(node);
  to:
        cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));
        

  There is no reason to iterate over the runqueues on the current
  node, which is what the code currently does.

Some numbers for anyone interested:Kernbench:

All numbers are based on a 2.5.55 kernel with the cputime stats patch:
  * stock55 = no additional patches
  * mini+rebal-44 = patches 01, 02, and 03
  * rebal+4+fix = patches 01, 02, 03, and the cpumask change described
    above, and a +4 constant added to the check in find_busiest_node
  * rebal+4+fix+04 = above with the 04 patch added

                        Elapsed User System CPU
   rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253%
      rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6%
       mini+rebal-55 30.756s 287.646s 85.512s 1212.8%
             stock55 31.018s 303.084s 86.626s 1256.2%

Schedbench 4:
                        AvgUser Elapsed TotalUser TotalSys
   rebal+4+fix+04-55 27.34 40.49 109.39 0.88
      rebal+4+fix-55 24.73 38.50 98.94 0.84
       mini+rebal-55 25.18 43.23 100.76 0.68
             stock55 31.38 41.55 125.54 1.24

Schedbench 8:
                        AvgUser Elapsed TotalUser TotalSys
   rebal+4+fix+04-55 30.05 44.15 240.48 2.50
      rebal+4+fix-55 34.33 46.40 274.73 2.31
       mini+rebal-55 32.99 52.42 264.00 2.08
             stock55 44.63 61.28 357.11 2.22

Schedbench 16:
                        AvgUser Elapsed TotalUser TotalSys
   rebal+4+fix+04-55 52.13 57.68 834.23 3.55
      rebal+4+fix-55 52.72 65.16 843.70 4.55
       mini+rebal-55 57.29 71.51 916.84 5.10
             stock55 66.91 85.08 1070.72 6.05

Schedbench 32:
                        AvgUser Elapsed TotalUser TotalSys
   rebal+4+fix+04-55 56.38 124.09 1804.67 7.71
      rebal+4+fix-55 55.13 115.18 1764.46 8.86
       mini+rebal-55 57.83 125.80 1850.84 10.19
             stock55 80.38 181.80 2572.70 13.22

Schedbench 64:
                        AvgUser Elapsed TotalUser TotalSys
   rebal+4+fix+04-55 57.42 238.32 3675.77 17.68
      rebal+4+fix-55 60.06 252.96 3844.62 18.88
       mini+rebal-55 58.15 245.30 3722.38 19.64
             stock55 123.96 513.66 7934.07 26.39

And here is the results from running numa_test 32 on rebal+4+fix+04:

Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 8.383
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1 100.0 0.0 0.0 0.0 | 0 0 | 56.19
  2 100.0 0.0 0.0 0.0 | 0 0 | 53.80
  3 0.0 0.0 100.0 0.0 | 2 2 | 55.61
  4 100.0 0.0 0.0 0.0 | 0 0 | 54.13
  5 3.7 0.0 0.0 96.3 | 3 3 | 56.48
  6 0.0 0.0 100.0 0.0 | 2 2 | 55.11
  7 0.0 0.0 100.0 0.0 | 2 2 | 55.94
  8 0.0 0.0 100.0 0.0 | 2 2 | 55.69
  9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53
 10 0.0 0.0 0.0 100.0 | 3 3 | 53.00
 11 0.0 99.2 0.0 0.8 | 1 1 | 56.72
 12 0.0 0.0 0.0 100.0 | 3 3 | 54.58
 13 0.0 100.0 0.0 0.0 | 1 1 | 59.38
 14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06
 15 0.0 100.0 0.0 0.0 | 1 1 | 56.02
 16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07
 17 0.0 100.0 0.0 0.0 | 1 1 | 53.78
 18 0.0 0.0 100.0 0.0 | 2 2 | 55.28
 19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20
 20 0.0 100.0 0.0 0.0 | 1 1 | 53.27
 21 0.0 0.0 100.0 0.0 | 2 2 | 55.79
 22 0.0 0.0 0.0 100.0 | 3 3 | 57.23
 23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05
 24 0.0 0.0 100.0 0.0 | 2 2 | 54.50
 25 0.0 0.0 0.0 100.0 | 3 3 | 56.82
 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
AverageUserTime 56.38 seconds
ElapsedTime 124.09
TotalUserTime 1804.67
TotalSysTime 7.71

Ideally, there would be nothing but 100.0 in all non-zero entries.
I'll try adding in the 05 patch, and if that does not help, will
try a few other adjustments.

Thanks for the quick effort on putting together the node rebalance
code. I'll also get some hackbench numbers soon.

-- 

Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jan 15 2003 - 22:00:48 EST