Re: SCHED_RR is broken + patch

From: Christian Ehrhardt (ehrhardt@mathematik.uni-ulm.de)
Date: Sat Apr 29 2000 - 07:33:10 EST


Hi,
attached is a new patch that should address most of the issues mentioned.
Comments are (of course) welcome.

How the patch works:
A second run queue for real time processes is introduced. Processes
with a real time scheduling policy are on this new run queue, SCHED_OTHER
priority processes are on the normal run queue.
The order of processes on the real time run queue is strictly honoured and
the movement of real time processes on this queue is made Unix98 (Posix)
compliant (I hope).
A process accessing the run queue must have interrupts disabled and hold
the run queue lock. We did this anyway but the comments in the scheduler
suggested something different.

Performance:
The normal path if there are no real time processes running should not
have additional overhead. The only case where the scheduler gets slower
should be if there are running real time processes but all of them are
already running on other CPUs. Add a task to one of the run queues needs
a few additional cycles.

A few comments on some of the mentioned issues:

On Wed, Apr 26, 2000 at 11:48:05PM +0200, Borislav Deianov wrote:
> On Wed, Apr 26, 2000 at 05:07:52AM -0700, George Anzinger wrote:
> > Borislav Deianov wrote:
> > >
> > > SCHED_RR threads don't work
> > > sched_yield doesn't work for RT threads
These two should work with the patch.

> > > sched_yield doesn't always yield for SCHED_OTHER threads
I don't think this is enfored by Unix98.

> > > wrong wake up order for RT threads
> > > sched_rr_get_interval returns constant bogus value
These two should be fixed as well.

> > > counter for SCHED_FIFO threads is never reset
Not yet handled. As I understand it this is not a bug it just introduces
additional overhead once the counter expired.

> > > several wrong and misleading comments in the code
Some should be fixed. I hope I didn't get this wrong.

> [...]
> I believe this approach has a fundamental problem, though. The current
> scheduler makes prev default (if still running) before it considers
> all other processes in the main loop. With this patch prev is
> considered when it's encountered in the main loop. I had a patch that
> did the same thing but have since decided that this is wrong for
> SCHED_FIFO threads. Quote from sched_setscheduler(2):
>
> | A call to sched_setscheduler or sched_setparam will put the
> | SCHED_FIFO process identified by pid at the end of the list if it
> | was runnable. ... A SCHED_FIFO process runs until either it is
> | blocked by an I/O request, it is preempted by a higher priority
> | process, or it calls sched_yield.
>
> This means that a SCHED_FIFO process can be forced to the end of the
> list for its priority while still running. Even so, it should continue
> to run and not be preempted by another process with the same
> priority. With your patch it would be preempted at the next
> reschedule.
>
> I wish I had the relevant POSIX standard to look at for this, it does
> seem to be a fine point.

Ok. I had a look at the Unix98 specification here's a short summary
of the things I found out:
* Unix98 assumes that only runnable (but not actually running processes)
  are not on the run queue. This implies (IMHO) that the scheduler must
  be called to select a new process after a currently running task was
  inserted at the start or end of a run queue.
* Calling sched_setscheduler or sched_setparam only moves the process to
  the end of the run queue if the priority or policy has actually been
  modified (This one may be debatable but that's how I read the spec.
  inserted into the run queue is also preempted at this point.
* I couldn't find anything in the specification that disallows that the
  same SCHED_OTHER process is selected after a call to sched_yield. The
  benefit would by questionable anyway. E.g. a process that just calls
  sched_yield in an endless loop could be running at the same time and
  nothing prevents the scheduler from selecting this process and then
  the original process again.

BTW: Unix98 is availiable online, e.g. at
        http://www.opengroup.org/onlinepubs/7908799/index.html

Ok. Here we go:
--- kernel/sched.c.orig Tue Jan 4 19:12:25 2000
+++ kernel/sched.c Sat Apr 29 12:20:03 2000
@@ -95,6 +95,16 @@
  */
  
 struct task_struct * task[NR_TASKS] = {&init_task, };
+/*
+ * The struct is necessary to prevent gcc from reordering the variables.
+ * The real time run queue is protected by runqueue_lock.
+ */
+struct {
+ struct task_struct * first, *last;
+} rt_queue;
+
+/* No need to worry about this! gcc/ld resolve this at link time */
+#define RT_QUEUE_HEAD ((struct task_struct *)(((unsigned long)(&(rt_queue.first)))-offsetof(struct task_struct, next_run)))
 
 /*
  * We align per-CPU scheduling data on cacheline boundaries,
@@ -357,24 +367,45 @@
         reschedule_idle_slow(p);
 }
 
+static inline void add_rt_queue_last (struct task_struct * p) {
+ p->prev_run = rt_queue.last;
+ p->next_run = RT_QUEUE_HEAD;
+ rt_queue.last->next_run = p;
+ rt_queue.last = p;
+}
+
+static inline void add_rt_queue_first (struct task_struct *p) {
+ p->next_run = rt_queue.first;
+ p->prev_run = RT_QUEUE_HEAD;
+ rt_queue.first->prev_run = p;
+ rt_queue.first = p;
+}
+
+
 /*
- * Careful!
- *
- * This has to add the process to the _beginning_ of the
- * run-queue, not the end. See the comment about "This is
- * subtle" in the scheduler proper..
+ * This function must be called with the runqueue lock held and interrupts
+ * disabled. Real time tasks are added to the end of the real time run queue,
+ * SCHED_OTHER tasks are added at the beginning of the normal run queue.
  */
 static inline void add_to_runqueue(struct task_struct * p)
 {
- struct task_struct *next = init_task.next_run;
-
- p->prev_run = &init_task;
- init_task.next_run = p;
- p->next_run = next;
- next->prev_run = p;
         nr_running++;
+ if ((p->policy & ~SCHED_YIELD) == SCHED_OTHER) {
+ struct task_struct *next = init_task.next_run;
+
+ p->prev_run = &init_task;
+ init_task.next_run = p;
+ p->next_run = next;
+ next->prev_run = p;
+ return;
+ }
+ /* RT process */
+ add_rt_queue_last (p);
 }
 
+/*
+ * Note that this works for both the normal and the real time run queue
+ */
 static inline void del_from_runqueue(struct task_struct * p)
 {
         struct task_struct *next = p->next_run;
@@ -395,13 +426,18 @@
         /* remove from list */
         next->prev_run = prev;
         prev->next_run = next;
- /* add back to list */
- p->next_run = &init_task;
- prev = init_task.prev_run;
- init_task.prev_run = p;
- p->prev_run = prev;
- prev->next_run = p;
-}
+ /* add back to correct list */
+ if ((p->policy & ~SCHED_YIELD) == SCHED_OTHER) {
+ p->next_run = &init_task;
+ prev = init_task.prev_run;
+ init_task.prev_run = p;
+ p->prev_run = prev;
+ prev->next_run = p;
+ return;
+ }
+ /* RT process */
+ add_rt_queue_last (p);
+}
 
 static inline void move_first_runqueue(struct task_struct * p)
 {
@@ -411,12 +447,17 @@
         /* remove from list */
         next->prev_run = prev;
         prev->next_run = next;
- /* add back to list */
- p->prev_run = &init_task;
- next = init_task.next_run;
- init_task.next_run = p;
- p->next_run = next;
- next->prev_run = p;
+ /* add back to correct list */
+ if ((p->policy & ~SCHED_YIELD) == SCHED_OTHER) {
+ p->prev_run = &init_task;
+ next = init_task.next_run;
+ init_task.next_run = p;
+ p->next_run = next;
+ next->prev_run = p;
+ return;
+ }
+ /* RT process */
+ add_rt_queue_first (p);
 }
 
 /*
@@ -714,11 +755,6 @@
 
         spin_lock_irq(&runqueue_lock);
 
- /* move an exhausted RR process to be last.. */
- if (prev->policy == SCHED_RR)
- goto move_rr_last;
-move_rr_back:
-
         switch (prev->state) {
                 case TASK_INTERRUPTIBLE:
                         if (signal_pending(prev)) {
@@ -731,6 +767,10 @@
         }
         prev->need_resched = 0;
 
+ /* Any runnable real time processes ? */
+ if (rt_queue.first != RT_QUEUE_HEAD)
+ goto select_rt_process;
+
 repeat_schedule:
 
         /*
@@ -745,21 +785,6 @@
                 goto still_running;
 still_running_back:
 
- /*
- * This is subtle.
- * Note how we can enable interrupts here, even
- * though interrupts can add processes to the run-
- * queue. This is because any new processes will
- * be added to the front of the queue, so "p" above
- * is a safe starting point.
- * run-queue deletion and re-ordering is protected by
- * the scheduler lock
- */
-/*
- * Note! there may appear new tasks on the run-queue during this, as
- * interrupts are enabled. However, they will be put on front of the
- * list, so our list starting at "p" is essentially fixed.
- */
         while (p != &init_task) {
                 if (can_schedule(p)) {
                         int weight = goodness(prev, p, this_cpu);
@@ -777,6 +802,7 @@
          * switching to the next task, save this fact in
          * sched_data.
          */
+found_next:
         sched_data->curr = next;
 #ifdef __SMP__
          next->has_cpu = 1;
@@ -828,6 +854,27 @@
         reacquire_kernel_lock(current);
         return;
 
+select_rt_process:
+ prev->policy &= ~SCHED_YIELD;
+ if (prev->policy == SCHED_RR)
+ goto move_rr_last;
+move_rr_back:
+ next = NULL;
+ c = 0;
+ p = rt_queue.first;
+ while (p != RT_QUEUE_HEAD) {
+ /* Sorry no way arround the second check */
+ if (can_schedule(p) || (p == prev)) {
+ if (p->priority > c)
+ c = p->priority, next = p;
+ }
+ p = p->next_run;
+ }
+ /* All runnable RT processes are already running */
+ if (!c)
+ goto repeat_schedule;
+ goto found_next;
+
 recalculate:
         {
                 struct task_struct *p;
@@ -856,7 +903,9 @@
 move_rr_last:
         if (!prev->counter) {
                 prev->counter = prev->priority;
- move_last_runqueue(prev);
+ /* We might have removed the task from the run queue */
+ if (prev->state == TASK_RUNNING)
+ move_last_runqueue(prev);
         }
         goto move_rr_back;
 
@@ -1765,12 +1814,17 @@
                 goto out_unlock;
 
         retval = 0;
- p->policy = policy;
- p->rt_priority = lp.sched_priority;
- if (p->next_run)
- move_first_runqueue(p);
-
- current->need_resched = 1;
+ /*
+ * UNIX98 requirement: Move the process to the end of the run queue
+ * iff the priority or policy actually changed.
+ */
+ if ((p->policy != policy) || (p->rt_priority != lp.sched_priority)) {
+ p->policy = policy;
+ p->rt_priority = lp.sched_priority;
+ if (p->next_run)
+ move_last_runqueue(p);
+ current->need_resched = 1;
+ }
 
 out_unlock:
         read_unlock(&tasklist_lock);
@@ -1892,9 +1946,22 @@
 asmlinkage int sys_sched_rr_get_interval(pid_t pid, struct timespec *interval)
 {
         struct timespec t;
+ struct task_struct * p;
+ int val;
 
+ val = current->priority;
+ if (pid && (pid != current->pid)) {
+ read_lock (&tasklist_lock);
+ p = find_task_by_pid (pid);
+ if (p)
+ val = p->priority;
+ read_unlock (&tasklist_lock);
+ if (!p)
+ return -ESRCH;
+ }
+
         t.tv_sec = 0;
- t.tv_nsec = 150000;
+ t.tv_nsec = (val+1) * 1000000000 / HZ;
         if (copy_to_user(interval, &t, sizeof(struct timespec)))
                 return -EFAULT;
         return 0;
@@ -2059,6 +2126,11 @@
 
         for(nr = 0; nr < PIDHASH_SZ; nr++)
                 pidhash[nr] = NULL;
+
+ if ((&rt_queue.first != &(RT_QUEUE_HEAD->next_run)) ||
+ (&rt_queue.last != &(RT_QUEUE_HEAD->prev_run)))
+ panic ("Wrong offsets of next_run/prev_run");
+ rt_queue.first = rt_queue.last = RT_QUEUE_HEAD;
 
         init_bh(TIMER_BH, timer_bh);
         init_bh(TQUEUE_BH, tqueue_bh);

    regards Christian

-- 
****************************************************************************
** Christian Ehrhardt  **  e-Mail: ehrhardt@mathematik.uni-ulm.de  *********
****************************************************************************
THAT'S ALL FOLKS!

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:16 EST