Scheduler problem in 2.2.1[34..]

From: Andris Pavenis (andris@stargate.astr.lu.lv)
Date: Fri Jan 28 2000 - 04:33:44 EST


I was getting kernel oopses in average some times per day when XFree86
was running (and much more seldom otherwise) in schedule().

It was for kernels 2.2.12, 2.2.13, 2.2.14 and also last prereleases of
2.2.15 (I'm built now 2.2.15pre5, but the data below are from 2.2.15pre4
with Rik's fix for mm/page_alloc.c (I'm getting also many gfp messages but
they doesn't resolve to anything usefull). I have posted data about oopses
earlier this month. It seems that also compiler I'm using to build kernel
doesn't mater (these data are for kernel built with gcc-2.7.2.3, but I
have tried also egcs-1.1.2 and gcc-2.95.2 and didn't saw any differences)

After that I patched del_from_runqueue() to verify argument (similary as it was
done in 2.0.3X kernels). Below is related output in log file:

Jan 27 21:21:05 hal kernel: del_from_runqueue(C03EE000) : Task not in run queue
Jan 27 21:21:05 hal kernel: prev_run=00000000 next_run=00000000 state=1 flags=0 nr_running=2
Jan 27 21:21:05 hal kernel: prev=c0db4000 next=c03b2000 pid=205
Jan 27 21:21:05 hal kernel: current=c03ee000 current->pid=205 current->state=1 current->flags=0
Jan 27 22:21:05 hal kernel: del_from_runqueue(C03EE000) : Task not in run queue
Jan 27 22:21:05 hal kernel: prev_run=00000000 next_run=00000000 state=1 flags=0 nr_running=2
Jan 27 22:21:05 hal kernel: prev=c0db4000 next=c03b2000 pid=205
Jan 27 22:21:05 hal kernel: current=c03ee000 current->pid=205 current->state=1 current->flags=0

It seems that schedule() tries to remove current task from runqueue twice
due to some problem (I don't know why). Without included patch I got oops.
This patch of course doesn't fix real problem but only avoids crashes.
Practically always affected process is maudio (from KDE-1.1.2) but
sometimes also kwm (also KDE-1.1.2). At least it seems that maudio is in
usable state after this thing happens.

At least after patching kernel/schedule.c I got no more oopses.

  PID TTY STAT TIME COMMAND
  205 ? S 0:02 maudio -media 1
  
I'm including below patch for kernel/sched.c I used

What I should to further to debug this problem? Is it possible to get
kernel stack trace without oops?

Andris

PS. I'm not subscribed to kernel mailing list so please send answers
    also to me

======================================================================
*** linux-2.2.15pre4/kernel/sched.c~1 Tue Jan 4 20:12:25 2000
--- linux-2.2.15pre4/kernel/sched.c Wed Jan 26 10:04:57 2000
***************
*** 380,385 ****
--- 380,397 ----
          struct task_struct *next = p->next_run;
          struct task_struct *prev = p->prev_run;
  
+ if (!prev || !next)
+ {
+ printk ("del_from_runqueue(%08X) : Task not in run queue\n",p);
+ printk ("prev_run=%p next_run=%p state=%X flags=%X nr_running=%d\n",
+ prev, next, p->state, p->flags, nr_running);
+ printk ("prev=%p next=%p pid=%d\n",
+ p->prev_task, p->next_task, (int) p->pid);
+ printk ("current=%p current->pid=%d current->state=%X current->flags=%X\n",
+ current,current->pid,current->state,current->flags);
+ return;
+ }
+
          nr_running--;
          next->prev_run = prev;
          prev->next_run = next;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 21:00:20 EST