[PATCH] 2,3,46 can_schedule macro

From: bhartner@us.ibm.com
Date: Mon Feb 28 2000 - 09:13:33 EST


Here is a patch that fixes the can_schedule macro for SMP.

[Dimitris Michailidis also identified this problem in a previous post]

Problem :

(1) on SMP, if a SCHED_OTHER process yields the processor
(sys_sched_yield) and there are no other processes on the
run queue, then priorities are recalculated.

(2) on SMP, if a SCHED_OTHER process yields the processor
(sys_sched_yield) and there is 1 or more lower priority
processes on the run queue, then a lower priority process
is dispatched and schedule_tail will migrate the yielding
process to an idle processor or preempt the process that was
just dispatched.

(3) the side effect of always recalculating priorities is that
it will reduce priority preemptions and time slice preemptions.

The problem can be demonstrated using the Scheduler Statistics
(sstat) patch available at

http://oss.software.ibm.com/developerworks/opensource/linux/patches/sstat/sstat_1.0_2.3.46.tar.gz

and a simple test program.

See the HTML document included in the tar file for details on the sstat
patch.

[the patch collects scheduler statistics and makes them available via
/proc/sstat]

I applied the sstat patch to 2.3.46 and wrote a simple test case
to yield the processor 1,000,000 times.

#include <unistd.h>

main (int argc, char **argv)
{
        int i = 1000000;
        do {
                syscall(158); /* sys_sched_yield() */
                i--;
        } while (i);
}

the following runs the workload and collects the stats on a 2-way
200 Mhz Netfinity 7000.

cat /proc/sstat > x1.dat # reset the stats
millionyields # run workload
cat /proc/sstat > x2.dat # get stats

Here is the output from x2.dat.

I have inserted comments using [].

                      CPU0 CPU1 TOTAL
cycles run 542133338 178635147 720768485
cycles sched 1900382931 627011090 2527394021

[3x more cpu spent in scheduler than running the yielding process]

cycles runque 1851569401 610902622 2462472023
PC0 00070045 4543704 1063285 5606989
PC1 000700C0 174067032 58485233 232552265

P1 -> P2 7 6 13
P1 -> P1 751922 248067 999989
P1 -> I 9 24 33
I -> P1 9 24 33
I -> I 5 6 11
switch_mm() 14 17 31
nr_running 752092 248184 1000276
recalculate: 751922 248068 999990

[the total number of recalculates is about 1,000,000]

sche.._tail() 5 6 11
wake_up_p..() 23 12 35
LAST/IDLE/CUR 4 9 13
CFT 000451EB 1 1 2
LAST/IDLE/IPI 9 2 11
----/IDLE/CUR 3 1 4
----/IDLE/IPI 11 5 16
LAST/----/CUR 0 0 0
LAST/----/IPI 0 0 0
----/----/CUR 0 0 0
----/----/IPI 0 0 0
need_re.. set 0 0 0
tail pre 5 5 10
tail pre cur 0 0 0

sys..yield() 751927 248073 1000000

[the total number of yields is 1,000,000]

time preempt 0 0 0

[even though it takes 16+ seconds to run the workload, no
timeslice preemptions occur, this is because the priorities
are always being recalculated]

resched ipis 7 20 27

----------------------

so applying the patch and re-running the workload yields :

                      CPU0 CPU1 TOTAL
cycles run 416492148 293901330 710393478
cycles sched 191775913 135847777 327623690

[2x more time spent running the yielding process than in scheduler]

cycles runque 153829470 108749664 262579134
PC0 00070045 2964234 1287855 4252089
PC1 000700C0 135837089 97143172 232980261

P1 -> P2 5 5 10
P1 -> P1 583139 416872 1000011
P1 -> I 8 7 15
I -> P1 8 7 15
I -> I 5 5 10
switch_mm() 11 9 20
nr_running 583206 416990 1000196
recalculate: 15 11 26

[the number of recalculates is inline with the number of
timeslice preemptions below]

sche.._tail() 4 4 8
wake_up_p..() 6 11 17
LAST/IDLE/CUR 1 4 5
CFT 000451EB 0 0 0
LAST/IDLE/IPI 3 3 6
----/IDLE/CUR 1 0 1
----/IDLE/IPI 5 8 13
LAST/----/CUR 0 0 0
LAST/----/IPI 0 0 0
----/----/CUR 0 0 0
----/----/IPI 0 0 0
need_re.. set 0 0 0
tail pre 4 4 8
tail pre cur 0 0 0

sys..yield() 583131 416869 1000000

[still have a total number of yields at 1,000,000]

time preempt 15 11 26

[the workload now only takes < 6 seconds and the number of
timeslice preemptions are in line with the length of the test]

resched ipis 11 8 19

--------------------- the patch ------------------------------

In the search of the run queue, for the common case, the patch adds
a branch that would not normally be taken, and a move from memory
that is most likely in L1 cache.

--- 2.3.47/kernel/sched.c Sun Feb 20 22:26:10 2000
+++ 2.3.47.canschedule/kernel/sched.c Sat Feb 26 16:28:10 2000
@@ -83,12 +83,12 @@
 #ifdef __SMP__

 #define idle_task(cpu) (init_tasks[cpu_number_map(cpu)])
-#define can_schedule(p) (!(p)->has_cpu)
+#define can_schedule(p,current_cpu) (!(p)->has_cpu || ((p)->processor ==
(current_cpu)))

 #else

 #define idle_task(cpu) (&init_task)
-#define can_schedule(p) (1)
+#define can_schedule(p,current_cpu) (1)

 #endif

@@ -502,7 +502,7 @@
     tmp = runqueue_head.next;
     while (tmp != &runqueue_head) {
          p = list_entry(tmp, struct task_struct, run_list);
- if (can_schedule(p)) {
+ if (can_schedule(p,this_cpu)) {
               int weight = goodness(p, this_cpu, prev->active_mm);
               if (weight > c)
                    c = weight, next = p;

Bill Hartner
IBM Linux Technology Center
bhartner@us.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 29 2000 - 21:00:19 EST