Racecondition with SMP+modules? (Was: oblems with Adaptec 2940 as a module)

Matthias Sattler (sattler@unix-ag.uni-kl.de)
Wed, 30 Oct 1996 11:26:59 +0100 (MET)


Hiho

Let's try to mail my bugreport for the third time, as the maildelivery
here at uni-kl seems to be very unreliable in the last few days.

I have problems loading some devicedrivers as modules, while they work fine
if compiled into the kernel.
Let's start with my setup so that you can better understand my problem.

- I use kernel 2.0.23 without 'Alan's latest and greatest ;)' patch.
(Alan: This patch isn't working for me. I suspect that the missing
physical CPU #1 is still causing this...)
- Gigabyte 586-DX dual pentium motherboard with intel HX chipset and
adaptec 2940UW on board used with two P100s.
- I boot the kernel into an initial ramdisk, then linuxrc starts kerneld
wich should load the adaptec module to mount the real / filesystem.
This worked fine for me with the bsd ncrdriver on a 486 for a long time.

My problem is, that I get an oops when kerneld tries to load the adaptec
module. I suspect a racecondition somewhere, because the oopses can be
different from try to try.
Kerneld isn't the problem because an insmod by hand (linuxrc -> /bin/sh)
fails in the same way.

Related problems may be:
- mcdx works if compiled into the kernel but not as a module.
- I get horrible flickering (like a badly tuned sattelite reciever) when
using some 16BPP screenmodes on large and rapidly changing pictures
(ico -faces is a good example). None of the errors remain in the picture
as soon as it is static again (S3-864 PCI graphicscard with 2MB DRAM which
worked fine too in my old 486).

Here comes the data of two oopses that appear frequently:
I start at the end of the calltrace and end with the EIP.

(gdb) l *0x109134
0x109134 is in start_kernel (init/main.c:773).
768 #ifdef __SMP__
769 static int first_cpu=1;
770
771 if(!first_cpu)
772 start_secondary();
773 first_cpu=0;
774
775 #endif
776 /*
777 * Interrupts are still disabled. Do necessary setups, then

(gdb) l *0x1099cf
0x1099cf is in cpu_idle (process.c:173).
168 continue;
169 }
170 smp_process_available--;
171 clear_bit(31,&smp_process_available);
172 sti();
173 idle();
174 }
175 }
176
177 #endif

(gdb) l *0x10ade2
No source file for address 0x10ade2.
By hand: 0010ad10 T system_call

(gdb) l*0x1098f8
0x1098f8 is in sys_idle (process.c:139).
134 smp_spins_sys_idle[smp_processor_id()]+=
135 smp_spins_syscall_cur[smp_processor_id()];
136 #endif
137 current->counter= -100;
138 schedule();
139 return 0;
140 }
141
142 /*
143 * This is being executed in task 0 'user space'.

(gdb) l *0x10ad9d
No source file for address 0x10ad9d.
By hand: 0010ad10 T system_call

EIP:
(gdb) l *0x113fef
0x113fef is in schedule (sched.c:241).
236 {
237 int weight;
238
239 #ifdef __SMP__
240 /* We are not permitted to run a task someone else is running */
241 if (p->processor != NO_PROC_ID)
242 return -1000;
243 #ifdef PAST_2_0
244 /* This process is locked to a processor group */
245 if (p->processor_mask && !(p->processor_mask & (1<<this_cpu))

Here comes the second... I only have the calltrace here (and not the EIP)
because it scrolled off the screen and the computer was dead afterwards.

(gdb) l *0x109554
0x109554 is in init (init/main.c:906).
901 mount_initrd = 0;
902 }
903 #endif
904
905 static int init(void * unused)
906 {
907 int pid,i;
908 #ifdef CONFIG_BLK_DEV_INITRD
909 int real_root_mountflags;
910 #endif

(gdb) l *0x109303
0x109303 is in start_kernel (/usr/src/linux/include/asm/unistd.h:303).
298 */
299 static inline pid_t kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
300 {
301 long retval;
302
303 __asm__ __volatile__(
304 "movl %%esp,%%esi\n\t"
305 "int $0x80\n\t" /* Linux/i386 system call */
306 "cmpl %%esp,%%esi\n\t" /* child or parent? */
307 "je 1f\n\t" /* parent - jump */

(gdb) l *0x109693
0x109693 is in init (init/main.c:958).
953
954 pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD);
955 if (pid>0)
956 while (pid != wait(&i));
957 if (real_root_dev != MKDEV(RAMDISK_MAJOR, 0)) {
958 error = change_root(real_root_dev,"/initrd");
959 if (error)
960 printk(KERN_ERR "Change root to /initrd: "
961 "error %d\n",error);
962 }

(gdb) l *0x12d57c
0x12d57c is in change_root (super.c:1052).
1047 printk(KERN_CRIT "New root is busy. Staying in initrd.\n");
1048 return -EBUSY;
1049 }
1050 ROOT_DEV = new_root_dev;
1051 do_mount_root();
1052 old_fs = get_fs();
1053 set_fs(get_ds());
1054 error = namei(put_old,&inode);
1055 if (error) inode = NULL;
1056 set_fs(old_fs);

(gdb) l *0x12d37e
0x12d37e is in do_mount_root (super.c:986).
981 filp.f_inode = &d_inode;
982 if ( root_mountflags & MS_RDONLY)
983 filp.f_mode = 1; /* read only */
984 else
985 filp.f_mode = 3; /* read write */
986 retval = blkdev_open(&d_inode, &filp);
987 if (retval == -EROFS) {
988 root_mountflags |= MS_RDONLY;
989 filp.f_mode = 1;
990 retval = blkdev_open(&d_inode, &filp);

(gdb) l *0x1282c2
0x1282c2 is in blkdev_open (devices.c:230).
225 * Called every time a block special file is opened
226 */
227 int blkdev_open(struct inode * inode, struct file * filp)
228 {
229 int ret = -ENODEV;
230 filp->f_op = get_blkfops(MAJOR(inode->i_rdev));
231 if (filp->f_op != NULL){
232 ret = 0;
233 if (filp->f_op->open != NULL)
234 ret = filp->f_op->open(inode,filp);

(gdb) l *0x12800c
0x12800c is in get_blkfops (devices.c:113).
108 Return the function table of a device.
109 Load the driver if needed.
110 */
111 struct file_operations * get_blkfops(unsigned int major)
112 {
113 return get_fops (major,0,MAX_BLKDEV,"block-major-%d",blkdevs);
114 }
115
116 struct file_operations * get_chrfops(unsigned int major, unsigned int minor)
117 {

(gdb) l *0x120038
0x120038 is in truncate_inode_pages (filemap.c:107).
102 if (PageLocked(page)) {
103 wait_on_page(page);
104 goto repeat;
105 }
106 inode->i_nrpages--;
107 if ((*p = page->next) != NULL)
108 (*p)->prev = page->prev;
109 page->dirty = 0;
110 page->next = NULL;
111 page->prev = NULL;

(gdb) l *0x127fe1
0x127fe1 is in get_fops (/usr/src/linux/include/linux/kerneld.h:62).
57 */
58 static inline int request_module(const char *name)
59 {
60 return kerneld_send(KERNELD_REQUEST_MODULE,
61 0 | KERNELD_WAIT,
62 strlen(name), name, NULL);
63 }
64
65 /*
66 * Request the removal of a module, maybe don't wait for it.

(gdb) l *0x139f6f
0x139f6f is in kerneld_send (msg.c:761).
756 ret_size &= ~KERNELD_WAIT;
757 kmsp.text = (char *)ret_val;
758 status = real_msgrcv(kerneld_msqid, (struct msgbuf *)&kmsp,
759 KDHDR + ((ret_val)?ret_size:0),
760 kmsp.id, msgflg);
761 if (status > 0) /* a valid answer contains at least a long */
762 status = kmsp.id;
763 }
764
765 #endif /* CONFIG_KERNELD */

(gdb) l *0x138fb8
0x138fb8 is in kd_timeout (msg.c:208).
203 #define KERNELD_TIMEOUT 1 * (HZ)
204 #define DROP_TIMER del_timer(&kd_timer)
205 /*#define DROP_TIMER if ((msgflg & IPC_KERNELD) && kd_timer.next && kd_timer.prev) del_timer(&kd_timer)*/
206
207 static void kd_timeout(unsigned long msgid)
208 {
209 struct msqid_ds *msq;
210 struct msg *tmsg;
211 unsigned long flags;
212

(gdb) l *0x13941c
0x13941c is in real_msgrcv (msg.c:387).
382 if (current->signal & ~current->blocked) {
383 DROP_TIMER;
384 return -EINTR;
385 }
386 interruptible_sleep_on (&msq->rwait);
387 }
388 } /* end while */
389 DROP_TIMER;
390 return -1;
391 }

(gdb) l *0x11467e
0x11467e is in interruptible_sleep_on (sched.c:584).
579 save_flags(flags);
580 cli();
581 __add_wait_queue(p, &wait);
582 sti();
583 schedule();
584 cli();
585 __remove_wait_queue(p, &wait);
586 restore_flags(flags);
587 }
588

(gdb) l *0x2011000
No source file for address 0x2011000.
Nothing found by hand... In a module?

Matthias

O .---------------. .___________. O
/\/ . `. m_sattle@ ,' / \ +FAX . \/\
__..--- ' /\/ | `._________,' | (___)/ * * \(___) \/ \ ` ---..__
""---__ \/`. | informatik. | / | \ +49 (0)6333 ,'\/ __---""
`.. / | .uni-kl.de | | `...' | -65079 \ ...'
`---------------' `._____.'

--> Don't take life too seriously -- you'll never get out of it alive. <--