Hang in work_resched

From: Guillaume Chazarain
Date: Tue Jan 29 2008 - 17:30:37 EST


Hi,

In recent git (0ba6c33bcddc64a54b5f1c25a696c4767dc76292) I experience
a hard to reproduce problem where a single process hangs in
work_resched. The method I use so far to reproduce it is to launch
mplayer in a gnome-terminal (Fedora 8,i386) and wait for mplayer to
freeze when it cannot write into gnome-terminal, the latter being
frozen in work_resched.

sysrq-t gives valid backtraces for all tasks, but only a single line
for gnome-terminal:

=======================
gnome-termina S 00000027 0 2201 1
f6711fb0 00200082 cb330d62 00000027 f664105c 00000b1e 00000000 cb331880
00000027 f660d780 009e3840 080ab7d8 080ab298 f6711000 c0103e7e 009e3840
000e0002 00000002 080ab7d8 080ab298 bfb41be8 080ab7d8 0000007b c010007b
Call Trace:
[<c0103e7e>] work_resched+0x5/0x16
=======================

This corresponds to the cli instruction:
c0103e7e: fa cli

When hung, stracing or killing gnome-terminal wakes it up, but not
kill -CONT without a previous kill -STOP.

While being straced gnome-terminal seems to never hang, and a strace
used to unfreeze it starts with:

23:06:53.182000 gettimeofday({1201644413, 182184}, NULL) = 0 <0.000401>
23:06:53.182644 poll([{fd=3, events=POLLIN, revents=POLLIN}, {fd=4,
events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=11,
events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=14,
events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=16,
events=POLLIN|POLLPRI}, {fd=18, events=POLLIN|POLLPRI}, {fd=17,
events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 11, 0) = 1
<0.000028>
23:06:53.182840 read(3,
"\34\326\324\20\\\0\0\0W\1\0\0&~\226\307\0\1\0\0\2\0\0\0\0\0\0\0\\\0\0\0"...,
4096) = 4096 <0.000166>
23:06:53.183504 read(3,
"\6\0\324\20K\206\226\307\\\0\0\0008\v`\2\0\0\0\0\311\2\232\1\n\0R\1\0\0\1\0"...,
4096) = 3616 <0.000168>
23:06:53.184073 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000022>
23:06:53.184188 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.184280 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000020>
23:06:53.184370 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.184460 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000020>
23:06:53.184570 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000022>
23:06:53.184661 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.184751 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.184840 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000651>
23:06:53.185610 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.185707 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.185797 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.185888 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.185977 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186067 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000022>
23:06:53.186158 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186247 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186337 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186426 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186516 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.186627 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000020>
23:06:53.186717 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.005377>
23:06:53.192208 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000022>
23:06:53.192310 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
23:06:53.192401 read(3, 0x80ab868, 4096) = -1 EAGAIN (Resource
temporarily unavailable) <0.000021>
...
then back to normal


A normal strace of gnome-terminal never shows more than one
consecutive read()->EAGAIN, unlike here, after a hang in work_resched.

I think file descriptor 3 is the local connection to the X server, and
the hang happens in a poll() call.

I understand this is a bit vague, but maybe I'll get some me-too.

--
Guillaume

Attachment: dmesg
Description: Binary data

Attachment: config
Description: Binary data