Re: ext2 filesystem corruption?!?!??

Mark Hemment (markhe@nextd.demon.co.uk)
Mon, 31 Mar 1997 10:27:22 +0000 (GMT)


On Sun, 30 Mar 1997, Jeff Garzik wrote:

> I am running a Usenet news server, and it has several ext2 filesystems
> spread across several SCSI disks and controllers. When the load reaches
> such that one of the news processes is always in disk wait, the on-disk
> info starts getting corrupted. The free list is the first to go,
> followed by general inode chaos. e2fsck often has to restart because it
> finds so many errors.

I might have found one of the causes for this corruption.

In fs/inode.c, truncate_inode_pages() is called from clear_inode() (to
remove pages from the named-page cache). If there are locked pages in the
inode->i_pages list, then the truncate operation will sleep (the lock
pages probably come from reading-ahead).
While truncate_inode_pages() is sleeping, __iget() is called from another
task for the inode which is being cleared. This task gets the inode,
which is then zero-filled when the truncate operation completes.
Alternatively, when truncate_inode_pages() is sleeping, get_empty_inode()
is called in another context and selects the same inode!
This will most likely happen under heavy load.
Below is a v. simply patch, it removes the inode from the hash and free
lists _before_ truncating (as the named-page cache is indexed via the
in-core inode address we shouldn't have any races in the cache itself).
The best fix is to totally re-write inode.c...
The patch was made on a 2.1.30 tree, but should apply to earlier cuts.

Regards,

markhe

------------------------------------------------------------------
Mark Hemment, Unix/C Software Engineer (Contractor)
markhe@nextd.demon.co.uk http://www.demon.co.uk/
"Success has many fathers, failure is a B**TARD!" - anon
------------------------------------------------------------------

--- linux-2.1.30/fs/inode.c Tue Mar 18 01:18:41 1997
+++ slab.1.30/fs/inode.c Mon Mar 31 10:03:48 1997
@@ -173,14 +173,15 @@
{
struct wait_queue * wait;

- truncate_inode_pages(inode, 0);
+ remove_inode_hash(inode);
+ remove_inode_free(inode);
+ if (inode->i_pages)
+ truncate_inode_pages(inode, 0);
wait_on_inode(inode);
if (IS_WRITABLE(inode)) {
if (inode->i_sb && inode->i_sb->dq_op)
inode->i_sb->dq_op->drop(inode);
}
- remove_inode_hash(inode);
- remove_inode_free(inode);
wait = ((volatile struct inode *) inode)->i_wait;
if (inode->i_count)
nr_free_inodes++;