Re: 2.4.24 SMP lockups

From: Marcelo Tosatti
Date: Wed Jan 14 2004 - 13:15:02 EST




On Wed, 14 Jan 2004, Simon Kirby wrote:

> On Sat, Jan 10, 2004 at 05:32:55PM -0200, Marcelo Tosatti wrote:
>
> > This sounds like a deadlock. I wonder why the NMI watchdog is not
> > triggering.
>
> Well, with the NMI watchdog working (nmi_watchdog=2), we just had another
> occurrence. This time, I had the serial console ready. :)
>
> I'm guessing this is the same as the previous cases; however, this time
> sysrq-P was able to print information from both CPUs. I assume the NMI
> watchdog unlocked interrupts from what would have been the stuck CPU?
>
> NMI Watchdog detected LOCKUP on CPU0, eip c011c7cb, registers:
> CPU: 0
> EIP: 0010:[<c011c7cb>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00000086
> eax: ddadf5d0 ebx: d8a2e000 ecx: 00000000 edx: d8a2fe50
> esi: d8a2fe50 edi: 00000286 ebp: 00020690 esp: d8a2fe30
> ds: 0018 es: 0018 ss: 0018
> Process php4 (pid: 19197, stackpage=d8a2f000)
> Stack: d8a2e000 d8a2fe50 ddadf5d0 c015a8e4 00000000 d8a2e000 00000000 00000000
> 00000000 d8a2e000 ddadf5d4 ddadf5d4 ddadf520 ddadf520 c1ce4178 c015b40b
> ddadf520 0000c82f 00000018 0000ffff c1ce4178 00020690 f7b73c00 c015b881
> Call Trace: [<c015a8e4>] [<c015b40b>] [<c015b881>] [<c0176e68>] [<c014e792>]
> [<c014ec7c>] [<c014f259>] [<c014f81e>] [<c01418ce>] [<c0141cf3>] [<c010926f>]
> Code: f3 90 7e f9 e9 8d e9 ff ff 80 3d c0 a3 31 c0 00 f3 90 7e f5
>
> >>EIP; c011c7ca <.text.lock.fork+1a/120> <=====
> Trace; c015a8e4 <__wait_on_freeing_inode+74/a0>
> Trace; c015b40a <find_inode+6a/80>
> Trace; c015b880 <iget4+60/110>
> Trace; c0176e68 <ext3_lookup+78/a0>
> Trace; c014e792 <real_lookup+f2/140>
> Trace; c014ec7c <link_path_walk+31c/6f0>
> Trace; c014f258 <path_lookup+38/40>
> Trace; c014f81e <open_namei+6e/690>
> Trace; c01418ce <filp_open+3e/70>
> Trace; c0141cf2 <sys_open+52/c0>
> Trace; c010926e <system_call+32/38>

Thanks so much for this Simon.

I'm not still sure why it is deadlocking. David Woodhouse and myself are
taking a closer look.

Anyway, please revert the attached patch and retry. It removes the
"__wait_on_freeing_inode" logic.
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1136.66.2 -> 1.1136.67.1
# fs/inode.c 1.41 -> 1.42
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/11/15 cattelan@xxxxxxxxxxxxxxxx 1.1199
# Merge lips.thebarn.com:/export/hose/bkroot/linux-2.4
# into lips.thebarn.com:/export/hose/bkroot/linux-2.4+justXFS
# --------------------------------------------
# 03/11/16 pmeda@xxxxxxxxxx 1.1136.66.3
# [netdrvr tulip] fix hashed setup frame code
#
# It is using local variable `i' in both the inner and outer loop.
#
# Need to bring the for loop outside the loop. Otherwise we need to reset the
# setup_frame to tp->setup_frame after every loop. You do not need to set the
# setup_frm for every mc address, we can set once after the complete has_table
# is ready.
# --------------------------------------------
# 03/11/17 livio@xxxxxxxxxx 1.1136.67.1
# [PATCH] Backport inode_hash race fix
#
# Hello,
#
# After trying to "get around" the inode_hash races when removing and
# iget()ing the same inode, my code got really ugly, and I got fed up. So
# yesterday I got Neil's 2.5 patch and backportted it to 2.4.22-rc2.
#
# The patch is very similar to Neil's, except for one (very important to
# me) case. Neil's patch only covered the removal of inodes in
# generic_delete_inode() (which in 2.4 is the case where i_nlink is zero in
# iput()). But, as I described in a previous post:
# http://marc.theaimsgroup.com/?l=linux-fsdevel&m=105547595519745&w=2
#
# , I frequently get busted in prune_icache(). In Neil's patch prune_icache
# is not covered. In my opinion, this case (in prune_icache()), has to be
# fixed in 2.6 too. Depending on your comments, I may make a patch for 2.6
# later.
#
# Please comment if you can, so that I may send this to Marcelo then
# 2.4.23-pre opens (which should be soon, I think).
#
# best regards,
#
# --
# Livio B. Soares
# --------------------------------------------
#
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c Wed Jan 14 15:52:23 2004
+++ b/fs/inode.c Wed Jan 14 15:52:23 2004
@@ -206,7 +206,8 @@
if ((inode->i_state & flags) != flags) {
inode->i_state |= flags;
/* Only add valid (ie hashed) inodes to the dirty list */
- if (!(inode->i_state & I_LOCK) && !list_empty(&inode->i_hash)) {
+ if (!(inode->i_state & (I_LOCK|I_FREEING|I_CLEAR)) &&
+ !list_empty(&inode->i_hash)) {
list_del(&inode->i_list);
list_add(&inode->i_list, &sb->s_dirty);
}
@@ -235,6 +236,30 @@
__wait_on_inode(inode);
}

+/*
+ * If we try to find an inode in the inode hash while it is being deleted, we
+ * have to wait until the filesystem completes its deletion before reporting
+ * that it isn't found. This is because iget will immediately call
+ * ->read_inode, and we want to be sure that evidence of the deletion is found
+ * by ->read_inode.
+ *
+ * This call might return early if an inode which shares the waitq is woken up.
+ * This is most easily handled by the caller which will loop around again
+ * looking for the inode.
+ *
+ * This is called with inode_lock held.
+ */
+static void __wait_on_freeing_inode(struct inode *inode)
+{
+ DECLARE_WAITQUEUE(wait, current);
+
+ add_wait_queue(&inode->i_wait, &wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock(&inode_lock);
+ schedule();
+ remove_wait_queue(&inode->i_wait, &wait);
+ spin_lock(&inode_lock);
+}

static inline void write_inode(struct inode *inode, int sync)
{
@@ -596,6 +621,11 @@
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
+ spin_lock(&inode_lock);
+ list_del(&inode->i_hash);
+ INIT_LIST_HEAD(&inode->i_hash);
+ spin_unlock(&inode_lock);
+ wake_up(&inode->i_wait);
destroy_inode(inode);
nr_disposed++;
}
@@ -707,6 +737,14 @@
*
* We don't expect to have to call this very often.
*
+ * We leave the inode in the inode hash table until *after*
+ * the filesystem's ->delete_inode (in dispose_list) completes.
+ * This ensures that an iget (such as nfsd might instigate) will
+ * always find up-to-date information either in the hash or on disk.
+ *
+ * I_FREEING is set so that no-one will take a new reference
+ * to the inode while it is being deleted.
+ *
* N.B. The spinlock is released during the call to
* dispose_list.
*/
@@ -739,8 +777,6 @@
if (atomic_read(&inode->i_count))
continue;
list_del(tmp);
- list_del(&inode->i_hash);
- INIT_LIST_HEAD(&inode->i_hash);
list_add(tmp, freeable);
inode->i_state |= I_FREEING;
count++;
@@ -793,6 +829,7 @@
struct list_head *tmp;
struct inode * inode;

+repeat:
tmp = head;
for (;;) {
tmp = tmp->next;
@@ -806,6 +843,10 @@
continue;
if (find_actor && !find_actor(inode, ino, opaque))
continue;
+ if (inode->i_state & (I_FREEING|I_CLEAR)) {
+ __wait_on_freeing_inode(inode);
+ goto repeat;
+ }
break;
}
return inode;
@@ -1076,8 +1117,6 @@
return;

if (!inode->i_nlink) {
- list_del(&inode->i_hash);
- INIT_LIST_HEAD(&inode->i_hash);
list_del(&inode->i_list);
INIT_LIST_HEAD(&inode->i_list);
inode->i_state|=I_FREEING;
@@ -1095,6 +1134,11 @@
delete(inode);
} else
clear_inode(inode);
+ spin_lock(&inode_lock);
+ list_del(&inode->i_hash);
+ INIT_LIST_HEAD(&inode->i_hash);
+ spin_unlock(&inode_lock);
+ wake_up(&inode->i_wait);
if (inode->i_state != I_CLEAR)
BUG();
} else {