Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered

From: Ross Dickson
Date: Thu Dec 11 2003 - 01:42:33 EST


I am trying to draw AMD into the picture using their ask AMD web form
but I think it is broken.

Could asia support please take this seriously and forward it to the appropriate
AMD technical personel. I believe the issue is not restricted to linux but
to any code which executes the same way.

The ask AMD submission follows:

Subject Details:
Possible CPU ERRATA: re: bus disconnect and apic timer interrupt

Greetings:
I and many others have been tracking down a hard lockup problem on linux and
nforce2 chipset.

Please find continuing discussion including a copy of this submission here:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/1528.html

My current level of knowledge (best estimate) on the problem is that if a cpu
disconnect cycle is in progress or has occurred and the local apic timer interrupt
is the trigger to return to a connected state then an undocumented timing
constraint exists. The constraint is that the local apic acknowledge will not be
correctly received by the local apic if it occurs earlier than about 500us
after the processor continues execution. That is if the processor issues
the ack earlier than 500us after resuming execution then an unrecoverable
hard lockup of the system occurs.

Possible causes include a slow start to the local system bus in relation to
the reconnection of the cpu to the local apic as per earlier model athlon CPU's?
Or system bus connect disconnect signal timing problems with the nforce2 northbridge?

What I would like to know is:

a) Can you please isolate- verify cause assuming you have hardware testing facilities.

b) Does this problem affect all local apic interrupt sources including those which
have come from an io-apic.

c) Is there is a chipset independent way of finding out if we are coming out of
a disconnect state prior to issuing the local apic acknowledge.
i.e. is there a readable status bit within the processor that we can use to see
if it is safe to immediately ACK the local apic or if we should wait for 500ns or so.

I have experienced this problem on XP2500 barton and XP2200 thoroughbred cores.
Others have experienced it on other model barton cores. At least 4 makes of
motherboard are involved.

So far it appears to affect all current and pending linux releases for the nforce2 chipsets.
One could say this relates to a good quantity of potential AMD athlon cpu sales
and bugs with nforce2 and AMD may sour uptake of nforce3 and x64.........

Regards
Ross Dickson.
Director.
Dat's Creative Pty Ltd
Gold Coast
Australia


I don't know if it got through, I received this after the submit button

The page cannot be displayed
There is a problem with the page you are trying to reach and it cannot be displayed.
Please try the following:
Click the Refresh button, or try again later; it does not normally take a long time for an application to restart.
Open the 139.95.253.214 home page, and then look for links to the information you want.
HTTP Error 500-12 Application Restarting
Internet Information Services
Technical Information (for support personnel)

Background:
The request cannot be processed while the Web site is restarting.

More information:
Microsoft Support

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/