Re: Freeze ups with Cyclades cards

G.W. Wettstein (greg@wind.rmcc.com)
Tue, 27 Jun 1995 13:25:52 CDT


On Jun 24, 6:27pm, Eric Schenk wrote:
} Subject: Freeze ups with Cyclades cards
>
> greg@wind.rmcc.com (G.W. Wettstein) writes:
> >Sorry to shout, IS THERE ANYONE ELSE IN THE WORLD (southern hemisphere
> > included.. :-) EXPERIENCING LOCK-UP PROBLEMS WITH THE
> > CYCLADES BOARDS?????
>
> I have just spent some 3 days fighting and solving a problem with
> freeze ups on the Cyclades board that may be related. I can't
> be sure because your original message did not describe the nature
> of the freeze ups in sufficient detail. I have been in contact with
> Marcio Saito <marcio@cyclades.com> at Cyclades, one of the two authors
> of the driver. He and Randolph Bentson have promised some changes
> in the next release of the driver due to the problems I reported.

... [ Description of problem deleted ] ...

> So, the temporary fix is to correct the busy wait loop above, and
> instrument the failure case with a printk so you can see if it is
> happening. The correct fix will have to involve a rewrite of the
> write_cy_cmd routine so that it no longer does busy waiting.
> Marcio reports that he and Ralph will be doing this for the next
> release of the driver.

Thank you Eric for a very informative commentary on your debugging
efforts with the Cyclades driver. Unfortunately this does not seem to
be the root of the problem at our site.

I made the change in the write_cy_cmd so that the comparison checks
for i = 100 rather than 1= 10 after the busy-wait loop. I also put a
printk inside the the failure clause, my sources now look like this:

/* if the CCR never cleared, the previous command
didn't finish within the "reasonable time" */
if ( i == 100 ) {
/*DEBUG*/
restore_flags(flags);
printk("Cyclades -- CCR didn't clear in reasonable time.");
return (-1);
}

I re-booted the test kernel and put one of our user's onto the first
port of the Cyclades card. The port locked-up after about 20 minutes
of use. There was no debug message in the kernel log files.

I dumped the status registers of the Cyclades ports right after boot.
The port in question had these initial values:

Jun 26 07:12:47 blizard kernel: klogd 1.2-pl4, log source = sys_syslog started.
NOTE: Kernel start, port status follows:
Jun 26 07:13:37 blizard kernel: show status line 0
Jun 26 07:13:37 blizard kernel: card 2, chip 0, channel 0
Jun 26 07:13:37 blizard kernel: cy_card
Jun 26 07:13:37 blizard kernel: irq base_addr num_chips first_line = 10 d4000 2 0
Jun 26 07:13:37 blizard kernel: cy_port
Jun 26 07:13:37 blizard kernel: card line flags = 2 0 c0000000
Jun 26 07:13:37 blizard kernel: *tty read_status_mask timeout xmit_fifo_size = 3bc000 ff 3 c
Jun 26 07:13:37 blizard kernel: cor1,cor2,cor3,cor4,cor5 = 3 20 9 0 0
Jun 26 07:13:37 blizard kernel: tbpr,tco,rbpr,rco = 81 1 81 1
Jun 26 07:13:37 blizard kernel: close_delay event count = 0 0 1
Jun 26 07:13:37 blizard kernel: x_char blocked_open = 0 0
Jun 26 07:13:37 blizard kernel: session pgrp open_wait = 3d 57 0
Jun 26 07:13:37 blizard kernel: CyGFRCR 46
Jun 26 07:13:37 blizard kernel: CyCAR c0
Jun 26 07:13:37 blizard kernel: CyGCR 0
Jun 26 07:13:37 blizard kernel: CySVRR 0
Jun 26 07:13:37 blizard kernel: CyRICR 0
Jun 26 07:13:37 blizard kernel: CyTICR 0
Jun 26 07:13:37 blizard kernel: CyMICR 0
Jun 26 07:13:37 blizard kernel: CyRIR 18
Jun 26 07:13:37 blizard kernel: CyTIR 10
Jun 26 07:13:37 blizard kernel: CyMIR a
Jun 26 07:13:37 blizard kernel: CyPPR f4
Jun 26 07:13:37 blizard kernel: CyRIVR 0
Jun 26 07:13:37 blizard kernel: CyTIVR 2
Jun 26 07:13:37 blizard kernel: CyMIVR 0
Jun 26 07:13:37 blizard kernel: CyMISR 0
Jun 26 07:13:37 blizard kernel: CyCCR 0
Jun 26 07:13:37 blizard kernel: CySRER 10
Jun 26 07:13:37 blizard kernel: CyCOR1 3
Jun 26 07:13:37 blizard kernel: CyCOR2 20
Jun 26 07:13:37 blizard kernel: CyCOR3 9
Jun 26 07:13:37 blizard kernel: CyCOR4 0
Jun 26 07:13:37 blizard kernel: CyCOR5 0
Jun 26 07:13:37 blizard kernel: CyCCSR 88
Jun 26 07:13:37 blizard kernel: CyRDCR 0
Jun 26 07:13:37 blizard kernel: CySCHR1 11
Jun 26 07:13:37 blizard kernel: CySCHR2 13
Jun 26 07:13:37 blizard kernel: CySCHR3 0
Jun 26 07:13:37 blizard kernel: CySCHR4 0
Jun 26 07:13:37 blizard kernel: CySCRL 0
Jun 26 07:13:37 blizard kernel: CySCRH 0
Jun 26 07:13:37 blizard kernel: CyLNC 0
Jun 26 07:13:37 blizard kernel: CyMCOR1 0
Jun 26 07:13:37 blizard kernel: CyMCOR2 0
Jun 26 07:13:37 blizard kernel: CyRTPR 2
Jun 26 07:13:37 blizard kernel: CyMSVR1 f3
Jun 26 07:13:37 blizard kernel: CyMSVR2 f3
Jun 26 07:13:37 blizard kernel: CyRBPR 51
Jun 26 07:13:37 blizard kernel: CyRCOR 21
Jun 26 07:13:37 blizard kernel: CyTBPR 51
Jun 26 07:13:37 blizard kernel: CyTCOR 9

After the port locked up I dumped the registers of port zero. Here
are the values after the lockup:

NOTE: port is reported as dead, status follows, note that only
NOTE: cub0 was used.
Jun 26 08:33:27 blizard kernel: show status line 0
Jun 26 08:33:27 blizard kernel: card 2, chip 0, channel 0
Jun 26 08:33:27 blizard kernel: cy_card
Jun 26 08:33:27 blizard kernel: irq base_addr num_chips first_line = 10 d4000 2 0
Jun 26 08:33:27 blizard kernel: cy_port
Jun 26 08:33:27 blizard kernel: card line flags = 2 0 c0000000
Jun 26 08:33:27 blizard kernel: *tty read_status_mask timeout xmit_fifo_size = 265000 ff 3 c
Jun 26 08:33:27 blizard kernel: cor1,cor2,cor3,cor4,cor5 = 3 20 9 0 0
Jun 26 08:33:27 blizard kernel: tbpr,tco,rbpr,rco = 81 1 81 1
Jun 26 08:33:27 blizard kernel: close_delay event count = 0 0 2
Jun 26 08:33:27 blizard kernel: x_char blocked_open = 0 0
Jun 26 08:33:27 blizard kernel: session pgrp open_wait = 88 99 0
Jun 26 08:33:27 blizard kernel: CyGFRCR e0
Jun 26 08:33:27 blizard kernel: CyCAR c0
Jun 26 08:33:27 blizard kernel: CyGCR e0
Jun 26 08:33:27 blizard kernel: CySVRR 0
Jun 26 08:33:27 blizard kernel: CyRICR e0
Jun 26 08:33:27 blizard kernel: CyTICR e0
Jun 26 08:33:27 blizard kernel: CyMICR e0
Jun 26 08:33:27 blizard kernel: CyRIR 18
Jun 26 08:33:27 blizard kernel: CyTIR 17
Jun 26 08:33:27 blizard kernel: CyMIR a
Jun 26 08:33:27 blizard kernel: CyPPR 1b
Jun 26 08:33:27 blizard kernel: CyRIVR e0
Jun 26 08:33:27 blizard kernel: CyTIVR e0
Jun 26 08:33:27 blizard kernel: CyMIVR e1
Jun 26 08:33:27 blizard kernel: CyMISR 0
Jun 26 08:33:27 blizard kernel: CyCCR 0
Jun 26 08:33:27 blizard kernel: CySRER e0
Jun 26 08:33:27 blizard kernel: CyCOR1 e0
Jun 26 08:33:27 blizard kernel: CyCOR2 e0
Jun 26 08:33:27 blizard kernel: CyCOR3 e0
Jun 26 08:33:27 blizard kernel: CyCOR4 e0
Jun 26 08:33:27 blizard kernel: CyCOR5 e0
Jun 26 08:33:27 blizard kernel: CyCCSR 0
Jun 26 08:33:27 blizard kernel: CyRDCR 0
Jun 26 08:33:27 blizard kernel: CySCHR1 e0
Jun 26 08:33:27 blizard kernel: CySCHR2 e0
Jun 26 08:33:27 blizard kernel: CySCHR3 e0
Jun 26 08:33:27 blizard kernel: CySCHR4 e0
Jun 26 08:33:27 blizard kernel: CySCRL e0
Jun 26 08:33:27 blizard kernel: CySCRH e0
Jun 26 08:33:27 blizard kernel: CyLNC e0
Jun 26 08:33:27 blizard kernel: CyMCOR1 e0
Jun 26 08:33:27 blizard kernel: CyMCOR2 e0
Jun 26 08:33:27 blizard kernel: CyRTPR e0
Jun 26 08:33:27 blizard kernel: CyMSVR1 70
Jun 26 08:33:27 blizard kernel: CyMSVR2 70
Jun 26 08:33:27 blizard kernel: CyRBPR a
Jun 26 08:33:27 blizard kernel: CyRCOR 3
Jun 26 08:33:27 blizard kernel: CyTBPR 40
Jun 26 08:33:27 blizard kernel: CyTCOR 2

I should probably explain explain our configuration a little more so
help those following along at home. It may also be helpful to Alan,
who commented in the kernel list about the possibility of the 1.3.x
networking code interacting with the Cyclades driver.

First of all I should state that the networking code is NOT affected
by the Cyclades driver. No matter what happens to the Cyclades card
the networking is unaffected. This is notwithstanding the oops that
we got last weekend. I have noticed that this is something that
other's have experienced as will so it is probably not related to the
Cyclades driver.

We have in essence built a terminal server out of one of our old
80386dx boxes. The corporate scheduling and patient database software
runs on an ES-9000 IBM mainframe. We have installed an AEA card in
one of the VTAM controllers on this mainframe. This AEA card
basically does protocol conversion between 3270 and vt100. Out of the
AEA card comes RS-232 lines carrying mainframe login sessions encoded
with vt100 control sequences.

The serial lines run into the Cyclades card on the 80386. When our
user's want a mainframe session they click on the appropriate menu
option of their software. This software (which runs the Cancer
Center) is X-based and runs on Linux based X workstations through the
Cancer Center.

Requesting the mainframe session causes a terminal emulator to be run
on the 80386 providing terminal server. The terminal emulator is run
inside an xterm which has its display variable pointed back to the
workstation requesting the interactive session. The emulator performs
the login sequence and the user now has a mainframe session running on
their Xwindow display.

This technique does have the side effect of generating simultaneous
network device activity and Cyclades activity. After initiating a
login session a user will typically work from 5-20 minutes before the
Cyclades port locks up in previously documented fashion. When the
lock-up occurs the results are variable, sometimes all ports on the
controller chip are locked. In this case the second controller chip
on the board will function but it too will lock-up. In other cases
locking up one port will lock-up the entire controller chip.

These login sessions are ultimately 3270 page-mode style interactions.
I mention this because my user's report that the lock-ups tend to
occur at the same time. Selecting a menu item on the mainframe
software typically causes full-screen refreshes of the data. The
user's report that the lock-up invariably occurs at the beginning of a
screen refresh. This behavior is what caused me to begin thinking
that the problem may be related to concommittant network and Cyclades
activity.

Hopefully this summary and the debugging information will be of some
use to Eric, Randolph or Marcio. We are bringing a major software
release on-line and I have been so swamped that I have been unable to
take a day or two and track this down.

If anyone would like additional information please feel free to
contact me. With the above description of our application there may
be Linux user's out there who could emulate our setup and see if the
freezes can be re-created elsewhere.

My pager is going off for about the umpteenth time. Thanks to
everyone for their interest in this problem.

> -- eric

}-- End of excerpt from Eric Schenk

As always,
Dr. G.W. Wettstein Oncology Research Div. Computing Facility
Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com
820 4th St. N.
Fargo, ND 58122
Phone: 701-234-7556
----------------------------------------------------------------------
`The truest mark of a man's wisdom is his ability to listen to other
men expound their wisdom.' -- GWW