Bug: App causes 2.034 kernel infinite loop in ext2

Don Bennett (dpb@infoseek.com)
Wed, 25 Nov 1998 18:24:02 -0800


Synopsis:

A heavily multithreaded application with lots of disk i/o causes a kernel
infinite loop in the ext2 code.

The application may run anywhere from an hour to a few days
before the problem to occurs.

Base release: RedHat 5.1
Kernel version: 2.0.34, 2.035
libc version: 2.0.7

A thread will unexpectedly start to use all available CPU cycles.

Gdb is sometimes unable to attach to this thread.

When the thread can be attached, there is no apparent reason for the
thread to be using all of the CPU time. Stepping at the machine
instruction seems to show that the thread is not actually executing -
it never makes it to the next instruction.

If you run 'top', it appears that all of the run time is
in 'system':

5:25pm up 1 day, 6:35, 3 users, load average: 1.00, 1.00, 1.43
76 processes: 72 sleeping, 4 running, 0 zombie, 0 stopped
CPU states: 0.3% user, 99.4% system, 0.0% nice, 0.2% idle
Mem: 127880K av, 117776K used, 10104K free, 15648K shrd, 19272K buff
Swap: 130748K av, 47044K used, 83704K free 41804K cached

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
29244 dpb 20 0 99660 63M 3212 R 0 96.9 50.4 788:53 pyseekd
4158 root 6 0 828 828 628 R 0 2.9 0.6 0:00 top
1 root 0 0 148 100 80 S 0 0.0 0.0 0:07 init
2 root 0 0 0 0 0 SW 0 0.0 0.0 0:22 kflushd
3 root -12 -12 0 0 0 SW< 0 0.0 0.0 0:26 kswapd
4 root 0 0 0 0 0 SW 0 0.0 0.0 0:00 nfsiod

I configured a pair of systems for kernel debugging.
I compiled the debug kernel with -g -O1, and removed the
'-fomit-frame-pointers' option.

If I interrupt the kernel, a typical backtrace looks like the following:

Program received signal SIGTRAP, Trace/breakpoint trap.
breakpoint () at i386-stub.c:750
(gdb) where
#0 breakpoint () at i386-stub.c:750
#1 0x179441 in gdb_interrupt (irq=0x4, dev_id=0x0, regs=0x0)
at serialstub.c:131
#2 0x10cb9e in do_fast_IRQ (irq=0x4) at irq.c:389
#3 0x10bd67 in fast_IRQ4_interrupt () at irq.c:89
#4 0x15c777 in ext2_truncate (inode=0x3642c00) at truncate.c:337
#5 0x1578f2 in ext2_put_inode (inode=0x3642c00) at inode.c:43 (fs/ext2/inode.c)
#6 0x12475f in iput (inode=0x3642c00) at inode.c:469 (fs/inode.c)
#7 0x159de6 in ext2_rmdir (dir=0x7f67c00, name=0x2d4301d "020014", len=0x6)
at namei.c:694
#8 0x12c95b in do_rmdir (name=0x2d43000 "/users/ultraseek/data/ucb/db/020014")
at namei.c:650
#9 0x12c9ab in sys_rmdir (
pathname=0x45d6eeb4 <Address 0x45d6eeb4 out of bounds>) at namei.c:663
#10 0x10a6c1 in system_call ()
Cannot access memory at address 0xbbbff664.
(gdb)

I have also been able to catch it at lines 331 and 343 in truncate.c.

If I do a 'cd /users/ultraseek/data/ucb/db; ls',
I see that the directory 020014 no longer appears.

Here's what the inode looks like:

(gdb) print *(struct inode *)0x3642c00
$8 = {
i_dev = 0x803,
i_ino = 0x2a829,
i_mode = 0x41ed,
i_nlink = 0x0,
i_uid = 0x1ef,
i_gid = 0x19,
i_rdev = 0x0,
i_size = 0x0,
i_atime = 0x365b5850,
i_mtime = 0x365b5854,
i_ctime = 0x365b5854,
i_blksize = 0x1000,
i_blocks = 0x0,
i_version = 0x1997a1,
i_nrpages = 0x0,
i_sem = {
count = 0x1,
waking = 0x0,
lock = 0x0,
wait = 0x0
},
i_op = 0x1ca620,
i_sb = 0x1ed818,
i_wait = 0x3642c48,
i_flock = 0x0,
i_mmap = 0x0,
i_pages = 0x0,
i_dquot = {0x0, 0x0},
i_next = 0x3642400,
i_prev = 0x342a500,
i_hash_next = 0x2a0cf00,
i_hash_prev = 0x1da6a00,
i_bound_to = 0x0,
i_bound_by = 0x0,
i_mount = 0x0,
i_count = 0x1,
i_flags = 0x0,
i_writecount = 0x0,
i_lock = 0x0,
i_dirt = 0x0,
i_pipe = 0x0,
i_sock = 0x0,
i_seek = 0x0,
i_update = 0x0,
i_condemned = 0x0,
u = {
ext2_i = {
i_data = {0x0, 0x4, 0x0 <repeats 13 times>},
i_flags = 0x0,
i_faddr = 0x0,
i_frag_no = 0x0,
i_frag_size = 0x0,
i_osync = 0x0,
i_file_acl = 0x0,
i_dir_acl = 0x0,
i_dtime = 0x365b5854,
i_version = 0x1,
i_block_group = 0x55,
i_next_alloc_block = 0x8,
i_next_alloc_goal = 0x8,
i_prealloc_block = 0x0,
i_prealloc_count = 0x0,
i_new_inode = 0x0
},
}
}
(gdb)

If I set a breakpoint at truncate.c:331 and step into the trunc_direct()
function, retry is set to one at line 85, because bh->b_count == 2:

(gdb) print bh
$21 = (struct buffer_head *) 0x7db8a18

(gdb) print *bh
$20 = {
b_blocknr = 0x4,
b_dev = 0x803,
b_rdev = 0x803,
b_rsector = 0x8,
b_next = 0x3af6298,
b_this_page = 0x7db8998,
b_state = 0x9,
b_next_free = 0x67f1918,
b_count = 0x2,
b_size = 0x400,
b_data = 0x7db3800 "\b",
b_list = 0x0,
b_flushtime = 0x0,
b_lru_time = 0xaa64ac,
b_wait = 0x7db8a48,
b_prev = 0x15bee98,
b_prev_free = 0x1886018,
b_reqnext = 0x0
}

That's all the diagnostic information I can see that might be useful.

If you have any ideas on what else I can do to track down this
problem, let me know.

If someone would like to try to run the application to try and
reproduce the probem, let me know.

Thanks,

Don Bennett
dpb@infoseek.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/