Suspected mouse driver (gpm) or kernel bug (long)

F. Chastrette (francois.chastrette@ansf.alcatel.fr)
Mon, 28 Apr 1997 13:42:55 +0200


I am under the impression that gpm, or the kernel serial driver, has a
bug which creates problems for my serial mouse. This message is a (long)
description of my observations, intended for kernel/serial/gpm experts.
Some hopefully useful debug output is included at the end of this
message.

First the symptoms: _sometimes_ my mouse does not work at all
(the pointer does not move, clicking has no effect). So far one could
think that this is a configuration problem. However, the same mouse
works very well if I wait long enough (about 20 minutes). In the
meantime I do absolutely nothing except sit there and watch it. In other
words, the mouse suddenly becomes active after a very long time.
This is very strange.

One could suspect a hardware problem, but I don't think that this is the
case, for the reasons stated below. Besides, there have been a
few (perhaps 5 or 10) postings in cols from people who obviously know
how to configure a mouse and still can't get it to work. So ?

This problem happens in text mode (gpm), and also under X if gpm is
running. For some reason it does not happen under X if gpm is not
running, although a very similar, but non-blocking, problem occurs
even then (see below).

My configuration :
RedHat 4.0 (2.0.18 kernel, gpm version 1.10, XFree 3.2).
Matrox Millenium (I am using the SVGA server).
Microsoft serial mouse. It is a 2-button mouse, with a 9-pin serial
connector, connected to my cua0 (COM1). It has a label which
says "Microsoft serial mouse, version 2.1A".
Pentium 166, Micronics (Taiwanese) motherboard with 430HX chipset.
Award Plug-and-play BIOS.

My hardware runs fine under Windows 95, and I have no IRQ conflicts
(I have checked the various /proc devices, the Windows 95 configuration
panel, and done a warm-boot from Windows into Linux). I observed the
problem right after I installed Linux, before I got a chance to change
anything from the default settings (besides configuring the install).

I know other people who run Red Hat 4.0 with a serial mouse. Some have
the problem, others don't. Recent postings in the news talk about very
similar problems (someone said he must unplug, then plug back in, his
mouse to get it to work !). To me, this rules out a hardware problem
on my system.

My XF86Config says that my mouse uses the Microsoft protocol and is
on /dev/mouse, with 3-button emulation.

gpm is started by init (Red Hat, SysV-style init) as follows:
gpm -t Microsoft # and an implicit -m /dev/mouse

/dev/mouse is linked to the right device (/dev/cua0) (changing it to
/dev/ttyS0 yields the same symptoms). Moving the mouse and the
link to cua1 changes nothing.

When the problem is present, no action on the mouse (moving it,
clicking buttons) causes any visible effect : under X the cursor does
not move, in text mode the mouse cursor (reverse video rectangle) is
not visible. Menus do not work, nor does cut-and-paste, in short the
mouse is inactive. But when the mouse is finally recognized, everything
works well, until I restart X or gpm (e.g. when I reboot my machine).
Then the whole cycle starts over. To me, this is an indication that my
configuration is OK (if it was not the mouse would never become active).

Needless to say, I have tried this in all sorts of ways: I tried not
to move the mouse at all, I also tried to move it very fast and click
all over the place. I also held various buttons down at boot-time. No
changes.

I also tried to use the PS/2 port but could not get it to work with this
mouse. I tried two other mice, PS/2 or serial, and got them to work
on both ports (this is one of the reasons why I am confident that
my configuration is OK) (but the other mice did not use the
Microsoft protocol).

I have done some work to understand what is going on :
1. run gpm under debug (gdb)
2. run gpm under strace (as recommended by the author)
3. changed the source code in gpm, in order to get debug output.

Note that I did not feel like recompiling the X server to debug the
problem, so all of my observations refer to gpm. The X situation is
probably the same.

Here is what I found out:

1. if gpm is run under debug, and if I step to the beginning of the
main gpm loop, everything is fine, and my mouse is active as soon
as I let the program run freely. However the mouse does not
become active if I let the program run freely from the beginning.
Analysis : the problem is timing-dependent.

2. if I add debug output to gpm, the mouse does not work, and my log
shows that the select() call at the beginning of the main loop always
exits with result 0 (no pending data). This explains why
the mouse is not active.

3. if gpm is run under strace, the same select() finds waiting data,
and my mouse is active.

4. test-mouse detects which port my mouse is connected to, then asks
me to click a button (so it can recognize the protocol), then waits
forever. In other words it seems to have the same problem as gpm, BUT
it received something on that port.

5. gpm detects my mouse when the -T flag (test) is set. Of course it
is quite useless in this case, as it does not perform any actions
associated with the mouse. Again this seems to indicate that
the problem is timing-dependent.

6. when gpm is apparently stalled, ps shows that it is executing a
do_select() call. Quite consistent with observation 2.

I think that the 0 result returned by select() is wrong. For one thing, the
mouse generates data when I move it (I have checked this by doing
"cat /dev/mouse"). Besides, whenever it is reset (?), it outputs a string
of bytes, always the same sequence, which gpm apparently never
receives.

This is what the mouse outputs when it starts up :
$ cat < /dev/mouse > mouse.dmp
$ od -t x1 mouse.dmp
0000000 4d 40 00 00 08 01 24 30 2e 30 10 26 10 21 3c 10
0000020 10 10 15 10 11 10 11 3c 2d 2f 35 33 25 3c 30 2e
0000040 30 10 26 10 21 3c 2d 29 23 32 2f 33 2f 26 34 00
0000060 33 25 32 29 21 2c 00 2d 2f 35 33 25 00 12 0e 11
0000100 21 15 11 09 *** first move/click indication comes next
0000104
The mouse outputs this sequence whenever I do this cat command. I
recognize the first four bytes (M, followed by a data packet with no
buttons pressed, dx==dy==0), but not remainder of the initial sequence.
If I click buttons or move it, I get extra bytes, which match the
Microsoft protocol described in the man page ("man mouse").

strace gives the following output :
(...)
1567 23:28:26 listen(8, 5) = 0
1567 23:28:26 sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}) = 0
1567 23:28:26 oldselect(9, [7 8], NULL, NULL, {600, 0}) = 1 (in [7], left
{600, 0})
1567 23:28:26 open("/dev/tty0", O_RDONLY) = 9
1567 23:28:26 ioctl(9, KDGETMODE, 0xbffffd38) = 0
1567 23:28:26 close(9) = 0
1567 23:28:26 read(7, "3", 1) = 1
1567 23:28:26 oldselect(9, [7 8], NULL, NULL, {600, 0}) = 1 (in [7], left
{600, 0})
1567 23:28:26 open("/dev/tty0", O_RDONLY) = 9
1567 23:28:26 ioctl(9, KDGETMODE, 0xbffffd38) = 0
1567 23:28:26 close(9) = 0
1567 23:28:26 read(7, "%", 1) = 1
1567 23:28:26 oldselect(9, [7 8], NULL, NULL, {600, 0}) = 1 (in [7], left
{600, 0})
1567 23:28:26 open("/dev/tty0", O_RDONLY) = 9
1567 23:28:26 ioctl(9, KDGETMODE, 0xbffffd38) = 0
1567 23:28:26 close(9) = 0
1567 23:28:26 read(7, "2", 1) = 1
(...)
Note that gpm does inded receive data. It seems to start in the middle
of the mouse output (the exerpt shown here shows the first time around the
main gpm loop). The remainder of the log shows that the first
received byte, "3", is the 0x33 at offset 0x60 in the mouse data shown
above).

I changed the gpm code as follows (besides enabling debug output in
gpmCfg.h, and adding LOG calls in get_mouse_data):
In gpm.c:
/*--------------------------------------- main loop begins here */

while(1)
{
selSet=readySet;
resetTimeout();
if (opt_test) timeout.tv_sec=0;

if (eventFlag) /* an event left over by clustering */
{
pending=1;
FD_ZERO(&selSet);
FD_SET(mouse_table[eventFlag].fd,&selSet);
}
else
while ((pending=select(maxfd+1,&selSet,NULL_SET,NULL_SET,&timeout))==0)
{
LOG(("Select : no action")); *** MY EDIT ***
selSet=readySet;
resetTimeout();
} /* go on */

The corresponding output is a (very long!) file which begins:
13:04:42 - 223: Signed
13:14:44 - 223: Select : no action
13:24:44 - 223: Select : no action
13:34:44 - 223: Select : no action
13:44:44 - 223: Select : no action
13:54:44 - 223: Select : no action
14:04:44 - 223: Select : no action
14:14:44 - 223: Select : no action
14:24:44 - 223: Select : no action
14:34:44 - 223: Select : no action
14:44:44 - 223: Select : no action
14:54:44 - 223: Select : no action
15:04:44 - 223: Select : no action
15:14:44 - 223: Select : no action
15:24:44 - 223: Select : no action
15:34:44 - 223: Select : no action
15:44:44 - 223: Select : no action
15:54:44 - 223: Select : no action
16:04:44 - 223: Select : no action
The time-out in the select call is indeed 10 minutes. In this case my
mouse apparently never became active (I let it run for 10 hours).

However, if I use the -T flag, I get something different:
20:58:04 - 1223: Signed
20:58:04 - 1223: Select : no action
20:58:29 - 1223: selected 1 times
20:58:29 - 1223: Got mouse
20:58:29 - 1223: First byte 4d
20:58:29 - 1223: Read 02 bytes, first two are 40 4d *** my debug output ***
20:58:29 - 1223: Read 01 bytes, first two are 00 40 *** (not quite right) ***
20:58:29 - 1223: Skipping a data packet (?)
20:58:29 - 1223: Select : no action
20:58:29 - 1223: selected 1 times
20:58:29 - 1223: Got mouse
20:58:29 - 1223: First byte 00
20:58:29 - 1223: Error in protocol
20:58:29 - 1223: Select : no action
20:58:29 - 1223: selected 1 times
20:58:29 - 1223: Got mouse
20:58:29 - 1223: First byte 08
20:58:29 - 1223: Error in protocol
20:58:29 - 1223: Select : no action
20:58:29 - 1223: selected 1 times
(...)
eventually the mouse became active after reading the whole init
sequence (which it does not understand, as shown in the messages
"skipping a data packet").

This shows that every other select() returns no pending data, but
the next one returns the appropriate byte. With the -T flag,
the timeout in the first select in a pair is set to 0. The second
call has the usual timeout (10 minutes). Now, again, I think that this
is abnormal. Of course, I got this result without moving the mouse at all.

Now the X problem. I now kill gpm before I start X. Everything runs fine,
except for a small detail : when my X session starts, after I log in, the
whole thing freezes until I click the (emulated) middle button of my mouse.
The freeze occurs right after xdm erases the login widget, but before any
windows are visible on the screen. During the freeze the pointer moves along
with the mouse, but nothing else happens. Only the (emulated) middle button
works, the other two do nothing.

If I do a ps while the system is waiting, I get the following :
$ ps -lx
FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME COMMAND
100100 500 710 698 9 0 2348 788 do_select S ? 0:10 fvwm95-2

I think that the two problems are the same. ps(1) reports that gpm is also
executing a do_select while it is waiting forever.

My analysis:
1. my mouse outputs data that gpm is not expecting. This in itself may cause
problems.
2. the select() call does not seem to work correctly. There seem to be two
problems, which are probably related:
(a) select() believes that there is no data on the serial line, which is wrong
(b) a select() with a null timeout returns no data, however the same select()
immediately thereafter receives data. This is wrong too. Data is there or not,
it is not there 50% of the time.
3. the problems are time-dependent, as shown by the correct behavior of the
same program under gdb and strace.
4. this really seems to be a kernel bug, because the X mouse driver certainly
does not use the gpm code (or does it ?).
5. however the problem is more complicated than the serial line driver not
working at all. Is it the HX chipset ? Is it because the bit rate for a
mouse is very low (1200 bps)? Is it the Microsoft mouse?

Please don't say "get rid of the Microsoft mouse". I can do that, but this
would not correct the bug, would it?

An (otherwise) happy Linux user.

-- 
François Chastrette