RE: SMP 2.2.15pre13 unstable on Dell PE1300 - aic7xxx related?

From: Tony Scholes (tonys@beacon.co.uk)
Date: Fri Mar 17 2000 - 05:32:17 EST


On Tuesday, March 07, 2000 8:00 PM, David L. Parsley (lkml account)
[SMTP:kparse@salem.k12.va.us] wrote:
> Hi Alan,
>
> On Mon, 6 Mar 2000, Alan Cox wrote:
>
> > The memory/NMI stuff bothers me. That is normally a parity or bus error
> > (does the box have ecc/parity ram ?)
>
> I counted 9 chips on the DIMMS, so I'm assuming ECC. Strangest looking
> DIMM I've ever seen, though; it looks like 2 DIMM's sandwiched together
> with one of them plugged in. So, 256M RAM in 1 DIMM, with 2x2x9=36 chips.
>
> I also noted something in my other replies I forgot to mention before; the
> box is solid for weeks with a UP kernel, but 2-3 days SMP.
>

David

May be unrelated but...

We had a similar problem with SMP on Dell PowerEdges which was solved (we
think) by Redhat patches to the kernel..

The stuff below may be useful.....

Since then a 2.2.12-22 rpm of the kernel has been installed which is supposed
to solve the problem with iBCS (if there was one)

> I'll run memtest86.
>
> regards,
> David
>
> - --
> David L. Parsley
> Network Administrator
> Roanoke College
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/

--
Tony Scholes
Technical Manager
=================================================
  Beacon Computer Services                      Tel: +44 (0)1582 478888
  The Friars, 82 High Street South           Fax: +44 (0)1582 478810
  Dunstable, Beds. UK                          mailto: tonys@beacon.co.uk
  LU6 3HD                                        Compuserve: 72660,207
=================================================
"Have you seen junior's grades?"

-----Original Message----- From: Tim Waugh [mailto:twaugh@redhat.com] Sent: 23 February 2000 19:23 To: malcolm@metalfast.co.uk Subject: Kernel upgrade

Hi Malcolm,

Here's what you need to do. The kernel RPM is at:

<URL:http://people.redhat.com/RPMS/kernel-smp-2.2.12-21.i386.rpm>

Get that onto metnew and do this, as root:

# rpm -ivh kernel-smp-2.2.12-21.i386.rpm

Next you need to make an initial ramdisk. Do this:

# cd /boot # mkinitrd --with megaraid --with aic7xxx initrd-2.2.12-21smp.img 2.2.12-21smp

Then you need to add these lines to /etc/lilo.conf, at the end:

image=/boot/vmlinuz-2.2.12-21smp label=linux-2.2.12-21smp root=/dev/sda6 initrd=/boot/initrd-2.2.12-21smp.img read-only

Run lilo, and reboot. At the LILO prompt, select 'linux-2.2.12-21smp'.

What I think the problem is is that the fput/fget file operations are involved in a race condition. This is happening because some module (probably iBCS) isn't taking a lock when it should.

This patched kernel makes fput/fget (more) safe against this race, and so should drastically decrease the frequency of these oopses; however, they may still happen, because the real bug (not taking the lock at the appropriate point) is not addressed. I'm currently looking at an iBCS patch to see if it is likely to solve the real problem. If I think it will I'll make you another kernel.

If my analysis is correct, you would also find that selecting 'linux-up' at the LILO prompt would make these oopses vanish altogether. If the oopses continue with the RPM I've built, please try linux-up to verify that this is an SMP-related bug. You will of course only have the use of one processor while the linux-up kernel is running.

Please note that due to lack of hardware and time, I haven't verified that this kernel actually works; I will do tomorrow morning.

Let me know if you have any questions.

Tim. */

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:21 EST