Re: [patch] SMP race fix [was Re: SMP lockup & 3c509 on 2.2.x [aka.

Andrea Arcangeli (andrea@e-mind.com)
Wed, 5 May 1999 12:26:26 +0200 (CEST)


I think I seen the race that maybe was the source of the popular SMP
network lockup with code like: disable_irq(); spin_lock(); ...
spin_unlock(); enable_irq().

As first here it is my untested (I can't reproduce here) fix that will
apply cleanly to 2.2.7:

Index: arch/i386/kernel/irq.c
===================================================================
RCS file: /var/cvs/linux/arch/i386/kernel/irq.c,v
retrieving revision 1.1.2.24
diff -u -r1.1.2.24 irq.c
--- linux/arch/i386/kernel/irq.c 1999/05/04 15:56:32 1.1.2.24
+++ linux/arch/i386/kernel/irq.c 1999/05/05 09:43:20
@@ -767,7 +767,7 @@
}
spin_unlock_irqrestore(&irq_controller_lock, flags);

- if (irq_desc[irq].status & IRQ_INPROGRESS)
+ while (irq_desc[irq].status & IRQ_INPROGRESS)
synchronize_irq();
}

My point is that the irq could be INPROGRESS (it could be running a bit
before the call to handle_irq_event()), but if so it could have not yet
issued a irq_enter() that would have increased global_irq_count. So the
global_irq_count could be still 0 at that time synchronize_irq(). So
synchronize_irq would be a nono even if the irq is going to run anyway.
And so we would continue in the no-irq path (after the call to
disable_irq) even if there was an irq in progress. And this in turn mean
that after the disable_irq() we could issue an enable_irq() while the irq
is still running, and doing that we would clear the INPROGRESS bit
allowing other level-apic irq to run nested on the first irq. Subtle? ;)

With my patch the disable_irq() path will stop until the IRQ_INPROGRESS
will be expired for real and even if this wouldn't be the cause of the
lockup it looks strictly needed to me.

I would like if people that is experiencing SMP lockups could try out my
patch above and feedback ;).

BTW the synchronize_irq() call after disable_irq() in 8390.c are really
not needed ;).

Index: /usr/src/linux/drivers/net/8390.c
===================================================================
RCS file: /var/cvs/linux/drivers/net/8390.c,v
retrieving revision 1.1.2.2
diff -u -r1.1.2.2 8390.c
--- linux/drivers/net/8390.c 1999/03/09 01:23:27 1.1.2.2
+++ linux/drivers/net/8390.c 1999/05/05 10:23:16
@@ -258,7 +258,6 @@
/* Ugly but a reset can be slow, yet must be protected */

disable_irq(dev->irq);
- synchronize_irq();
spin_lock(&ei_local->page_lock);

/* Try to restart the card. Perhaps the user has fixed something. */
@@ -287,7 +286,6 @@
*/

disable_irq(dev->irq);
- synchronize_irq();

spin_lock(&ei_local->page_lock);

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/