Consistent deadlock+crash in 2.6.5 when writing large file to CFSd (over NFS)

From: Elladan
Date: Wed May 19 2004 - 02:54:40 EST


I'm getting consistent crashes in 2.6.5 when I write a large file to cfs
(crypto file system). This is a user daemon that implements an NFS
export of encrypted files on disk.

Basically, what appears to happen is that it's deadlocking something in
the IO system. I can't really tell what, because the entire system
seizes up soon afterward.

The behavior is as follows:

1. cp largefile /crypt/ell/ (largefile is on disk b, the crypto area is
on disk a)
2. Disk activity is high for a brief period of time (a few seconds)
3. Disk activity falls to nothing. A quick inspection with ps alx
indicates that more and more processes are stacking up in D state,
blocked on "schedu". These may be processes accessing any disk.
4. Periodically, the system may unfreeze briefly, causing a huge burst
of IO. But it'll stop again within a few seconds.
5. Eventually, the system will deadlock completely and have to be power
cycled.
6. However, until it deadlocks, some commands are able to touch the
disks, and I'm able to look at some directories that aren't already
in cache etc.

Here's some random data:

Linux tandu 2.6.5 #12 Sun Apr 25 13:01:19 PDT 2004 i686 GNU/Linux

Module Size Used by
usblp 11200 0
pl2303 14980 0
usbserial 27056 1 pl2303
joydev 8256 0
ehci_hcd 24132 0
ohci_hcd 16900 0
nvidia_agp 5788 1
tuner 17100 0
tvaudio 20428 0
bttv 141804 0
video_buf 16580 1 bttv
i2c_algo_bit 8904 1 bttv
v4l2_common 4928 1 bttv
btcx_risc 3848 1 bttv
videodev 7296 1 bttv
snd_emu10k1 92292 0
snd_rawmidi 20448 1 snd_emu10k1
snd_pcm_oss 49124 0
snd_mixer_oss 17408 1 snd_pcm_oss
snd_pcm 85348 2 snd_emu10k1,snd_pcm_oss
snd_timer 21636 1 snd_pcm
snd_seq_device 6536 2 snd_emu10k1,snd_rawmidi
snd_ac97_codec 61444 1 snd_emu10k1
snd_page_alloc 8964 2 snd_emu10k1,snd_pcm
snd_util_mem 3264 1 snd_emu10k1
snd_hwdep 7264 1 snd_emu10k1
snd 45732 10 snd_emu10k1,snd_rawmidi,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer,snd_seq_device,snd_ac97_codec,snd_util_mem,snd_hwdep
soundcore 7392 2 bttv,snd
i2c_sensor 2368 0
i2c_isa 1664 0
i2c_nforce2 5120 0
i2c_core 18756 7 tuner,tvaudio,bttv,i2c_algo_bit,i2c_sensor,i2c_isa,i2c_nforce2
agpgart 28072 1 nvidia_agp
sg 29472 0


F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 1 0 16 0 1528 528 schedu S ? 0:04 init [2]
1 0 2 1 34 19 0 0 ksofti SN ? 0:00 [ksoftirqd/0]
5 0 3 1 5 -10 0 0 worker S< ? 0:00 [events/0]
1 0 4 3 5 -10 0 0 worker S< ? 0:00 [kblockd/0]
1 0 6 3 20 0 0 0 pdflus S ? 0:00 [pdflush]
1 0 5 1 15 0 0 0 hub_th S ? 0:00 [khubd]
1 0 7 3 15 0 0 0 pdflus S ? 0:00 [pdflush]
1 0 9 3 10 -10 0 0 worker S< ? 0:00 [aio/0]
1 0 8 1 15 0 0 0 kswapd S ? 0:00 [kswapd0]
1 0 10 1 19 0 0 0 down_i S ? 0:00 [khpsbpkt]
1 0 11 1 19 0 0 0 serio_ S ? 0:00 [kseriod]
1 0 12 1 15 0 0 0 kjourn S ? 0:00 [kjournald]
1 0 172 1 16 0 0 0 kjourn S ? 0:00 [kjournald]
1 0 173 1 15 0 0 0 kjourn S ? 0:00 [kjournald]
1 0 174 3 5 -10 0 0 worker S< ? 0:00 [reiserfs/0]
1 0 175 1 18 0 0 0 kjourn S ? 0:00 [kjournald]
1 0 176 1 16 0 0 0 kjourn S ? 0:00 [kjournald]
1 0 177 1 15 0 0 0 kjourn S ? 0:00 [kjournald]
5 1 1259 1 16 0 1640 444 schedu Ss ? 0:00 /sbin/portmap
5 0 1357 1 15 0 2284 820 schedu Ds ? 0:00 /sbin/syslogd
5 0 1360 1 16 0 2488 1460 syslog Ss ? 0:00 /sbin/klogd
5 0 1396 1 16 0 2900 1732 schedu Ss ? 0:00 /usr/sbin/named
5 0 1413 1 16 0 2616 1040 schedu S ? 0:00 ptal-mlcd mlc:usb:photosmart_7200_series -devidmatch MDL:photosmart 7200 series; -devidmatch SN:CN39G3B8CPI5; -device /dev/usb/lp[0-9]* /dev/usblp[0-9]*
5 0 1416 1 17 0 3216 856 schedu S ? 0:00 ptal-printd mlc:usb:photosmart_7200_series -morepipes 9 -like /etc/ptal/ptal-printd-like
5 0 1419 1 19 0 3220 844 schedu S ? 0:00 ptal-photod mlc:usb:photosmart_7200_series -maxaltports 26
5 0 1425 1 18 0 2404 924 schedu Ss ? 0:00 /sbin/rpc.statd
5 0 1459 1 16 0 2416 948 schedu S ? 0:00 /usr/sbin/ez-ipupdate -d -c /etc/ez-ipupdate/tandu.conf -F /var/run/ez-ipupdate/tandu.pid
5 0 1463 1 18 0 2768 1160 schedu Ss ? 0:00 /usr/sbin/famd -T 0
5 0 1473 1 15 0 2264 716 schedu Ss ? 0:00 /usr/sbin/inetd
1 1 1484 1 17 0 4068 1296 schedu Ss ? 0:00 lpd Waiting
4 0 1499 1 24 0 2540 1236 wait4 S ? 0:00 /bin/sh /usr/bin/mysqld_safe
4 102 1534 1499 15 0 67340 7840 schedu S ? 0:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/mysqld/mysqld.sock
1 108 1550 1 16 0 104248 14612 schedu Ss ? 0:00 /usr/bin/mythbackend --daemon --logfile /var/log/mythtv/mythbackend.log --pidfile /var/run/mythtv/mythbackend.pid
0 31 1595 1 15 0 17112 2180 schedu S ? 0:00 /usr/lib/postgresql/bin/postmaster -D /var/lib/postgres/data
1 31 1599 1595 15 0 7912 1972 schedu S ? 0:00 postgres: stats buffer process
1 31 1600 1599 15 0 7052 2036 schedu S ? 0:00 postgres: stats collector process
0 108 1607 1606 34 19 59776 32492 schedu DN ? 2:17 mythtranscode -c 1007 -s 2004-05-15T21:00:00 -p autodetect -d -l
1 31 1621 1 15 0 4556 1592 schedu Ss ? 0:00 /usr/lib/postgresql/bin/pg_autovacuum -D -p 5432 -L /var/log/postgresql/autovacuum_log
5 0 1631 1 18 0 1548 476 schedu Ss ? 0:00 /usr/sbin/smartd
5 0 1685 1 18 0

Note that mythtranscode appears to have seized up in this dump. Also
note that the dump stops right where it appears to stop. The computer
crashed.

No, the kernel isn't tainted.

Any ideas?

-J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/