Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal

From: Andrew Morton
Date: Thu Sep 16 2004 - 16:49:39 EST


Seiji Kihara <kihara.seiji@xxxxxxxxxxxxx> wrote:
>
> We found that fsync and fdatasync syscalls sometimes don't sync
> data in an ext3 file system under the following conditions.
>
> 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
> 2. Ext3's journalling mode is "data=journal".
> 3. Create a file (whose size is 1Mbytes) and execute umount/mount.
> 4. lseek to a random position within the file, write 8192 bytes
> data, and fsync or fdatasync.
>
> We presume the data was not written to the corresponding disk
> before returning from fsync or fdatasync syscall on the evidence
> as follows:
>
> 1. The response time of fsync() and fdatasync() was extremely
> short.
>
> We use the "diskio" tool, which is downloadable from OSDL page
> (http://developer.osdl.jp/projects/doubt/). The program showed
> that the response time was under 10 microseconds. This time
> cannot be achieved with data transfer on IDE and PCI bus!
>
> 2. The IDE writing routine ide_start_dma() was not called under
> DMA enabled.
>
> We inserted the print messages in the sys_write(), sys_fsync()
> and ide_start_dma() by the attached patch. Sometimes the
> "ide_start_dma: ..." message was not shown between "write: in
> ..." and "fsync: out ...".
>
> The problem was occurred since 2.6.5-bk1, which includes the patch
> "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
> problem was solved by deleting the part of the patch which
> modifies ext3_sync_file(). Maybe, i_state is not correctly set to
> I_DIRTY when the related page cache is dirty (is it true?)

I forgot about this one.

> Attached file is tarball (tar + bzip2) which contains following
> files. The patches are for 2.6.8.1-kernel (applicable to
> 2.6.9-rc2), and the results were also produced with
> 2.6.8.1-kernel.

We really don't need a 100k tarball to communicate a three line patch :(

Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
write() time when the filesystem is mounted in data=journal mode.

However your patch will disable the above optimisation for data=writeback
and data-ordered modes as well. I don't think that's necessary?

How about this?

--- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix Thu Sep 16 14:47:21 2004
+++ 25-akpm/fs/ext3/fsync.c Thu Sep 16 14:47:33 2004
@@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s

J_ASSERT(ext3_journal_current_handle() == 0);

- smp_mb(); /* prepare for lockless i_state read */
- if (!(inode->i_state & I_DIRTY))
- goto out;
-
/*
* data=writeback:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s
goto out;
}

+ smp_mb(); /* prepare for lockless i_state read */
+ if (!(inode->i_state & I_DIRTY))
+ goto out;
+
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/