pseudo terminals just hang sometimes (bugreport)

Miquel van Smoorenburg (miquels@cistron.nl)
23 Jul 1996 12:36:09 +0200


>From time to time a rlogin or telnet session to one of our Linux machines
fails. The rlogin just "hangs" without giving any output. The user _is_
logged in though:

11:15am up 2:49, 6 users, load average: 0.05, 0.10, 0.08
User tty login@ idle JCPU PCPU what
dth ttyp0 8:27am 14 -zsh
miquels ttyp2 11:03am 11 -
miquels ttyp5 11:04am -

The session on ttyp2 hangs, the rest works OK.

An fuser shows that more than one thing is running on ttyp2:

[picard:root](/proc/780/fd)> fuser -v /dev/ttyp2

USER PID ACCESS COMMAND
/dev/ttyp2 maus 780 f.... eggdrop
root 3121 f.... login

The first program, "eggdrop" was put in the background by one of our users,
then he logged out. Some way or the other this causes a second session to
hang. The login hangs in a write():

[picard:root](/proc/780/fd)> strace -fp 3121
write(1, "Last login: Tue Jul 23 01:40:46 "..., 54

"ps" shows that the write hangs in "write_chan", ofcourse.

The rlogind that is connected to the other side of the pty is doing a select:

[picard:root](/proc/3120/fd)> fuser -v /dev/ptyp2

USER PID ACCESS COMMAND
/dev/ptyp2 root 3120 f.... in.rlogind

[picard:root](/proc/3120/fd)> strace -fp 3120
oldselect(4, [0 3], NULL, [3], NULL <unfinished ...>

Checking /proc/3120/fd reveals that fd #3 is indeed connected to the pty (#0
is the network socket). Also the process on the slave side of the pty
has its fd's connected to the right pty.

Somehow, the select() on the master pty doesn't see that something's
being written on the slave pty.

Oh, the other process that's running on /dev/ttyp2 doesn't do anything
with the pty; it's busy with a select() on some sockets:

[picard:root](/proc/3120/fd)> strace -fp 780
oldselect(256, [3 5 6], NULL, NULL, {0, 970000}) = 0 (Timeout)
time(NULL) = 838114312
oldselect(256, [3 5 6], NULL, NULL, {1, 0}) = 0 (Timeout)
time(NULL) = 838114313
oldselect(256, [3 5 6], NULL, NULL, {1, 0} <unfinished ...>

However the times that I've observed this there _always_ was a second
process running on the ttyp?, so it must have something to do with it.

After killing both processes (the login and eggdrop things), the pty
works again. Killing only one of them will leave the pty blocked...

And thusfar, I cannot reproduce this - it just happens.

Oh, kernel 2.0.8

Mike.

-- 
  Miquel van    | Cistron Internet Services   --    Alphen aan den Rijn.
  Smoorenburg,  | mailto:info@cistron.nl          http://www.cistron.nl/
miquels@het.net | Tel: +31-172-419445 (Voice) 430979 (Fax) 442580 (Data)