VFS (a_ops) change for efficient O_SYNC support?

From: Ulrich Kunitz (gefm21@uumail.de)
Date: Sun May 14 2000 - 10:12:44 EST


Andi Kleen wrote:

> It is just very very inefficient to do it this way, especially on file systems
> that don't use page sized blocks. Fsync does a lot of work while searching
> the buffer cache for dirty blocks. For page block sized it is a bit better,
> but still rather slow. Implementing O_SYNC directly is preferable I think.

If we don't want to use fsync(), we need to check that the data has been
written like in generic_buffer_fdatasync(). I don't want to do that in
a_ops->commit_write(), because large O_SYNC writes will be slowed quite
a bit.

My proposal would be to include write_one_page() and waitfor_one_page()
from filemap.c into the adress space operations, maybe as a single
function write_back_page() or as an interface change to sync_page().

At the down side we have another VFS interface change. But we get a
generic_fdatasync() that can be used to implement sys_fdatasync() the
right way. We have also kept the syncing semantics at the generic layer.

Way to go?

Here is an experimental BROKEN patch, that assumes that
generic_file_write() is only used with buffer based fs. It's some kind
of generic_buffer_file_write(). I don't want it this way.
The layering would become wrong as with generic_buffer_fdatasync()
and it would double code.

As I said the patch breaks with O_SYNC writes on NFS, so please
don't tell me. I've tested with ext2 and I write this mail using this
kernel.

diff -uNr linux-2.3.99-pre8/mm/filemap.c linux-2.3.99-pre8-uk3/mm/filemap.c
--- linux-2.3.99-pre8/mm/filemap.c Sat May 13 10:13:32 2000
+++ linux-2.3.99-pre8-uk3/mm/filemap.c Sun May 14 16:44:08 2000
@@ -39,6 +39,10 @@
  * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
  *
  * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
+ *
+ * Ulrich Kunitz <gefm21@uumail.de>, 14. 5. 2000:
+ * experimental (BROKEN) implementation of O_SYNC support
+ * works only with buffer based fs, NFS broken
  */
 
 atomic_t page_cache_size = ATOMIC_INIT(0);
@@ -2398,6 +2402,26 @@
         }
 }
 
+/* This function is used by generic_file_write to wait on pages
+ * that have been written to disk and unlock them.
+ */
+static inline int wait_on_synced_pages(struct page *pages[], int cnt)
+{
+ int err;
+ int status = 0;
+ int i;
+
+ for (i = 0; i < cnt; i++) {
+ err = waitfor_one_page(pages[i]);
+ if (err < 0 && status == 0)
+ status = err;
+ UnlockPage(pages[i]);
+ page_cache_release(pages[i]);
+ }
+
+ return status;
+}
+
 /*
  * Write to a file through the page cache.
  *
@@ -2413,6 +2437,9 @@
  * file system has to do this all by itself, unfortunately.
  * okir@monad.swb.de
  */
+
+#define NBUF 16
+
 ssize_t
 generic_file_write(struct file *file,const char *buf,size_t count,loff_t *ppos)
 {
@@ -2421,11 +2448,14 @@
         unsigned long limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
         loff_t pos;
         struct page *page, *cached_page;
+ struct page *synced_pages[NBUF];
+ int cnt_synced_pages;
         unsigned long written;
         long status;
         int err;
 
         cached_page = NULL;
+ cnt_synced_pages = 0;
 
         down(&inode->i_sem);
 
@@ -2501,12 +2531,28 @@
                 status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
                 if (!status)
                         status = bytes;
+ if (status < 0)
+ goto unlock;
 
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
+ written += status;
+ count -= status;
+ pos += status;
+ buf += status;
+
+ /* If O_SYNC is set, write page to disk */
+ if (file->f_flags & O_SYNC) {
+ status = writeout_one_page(page);
+ if (status < 0)
+ goto unlock;
+ if (cnt_synced_pages == NBUF) {
+ status = wait_on_synced_pages(
+ synced_pages, cnt_synced_pages);
+ cnt_synced_pages = 0;
+ if (status < 0)
+ break;
+ }
+ synced_pages[cnt_synced_pages++] = page;
+ continue;
                 }
 unlock:
                 /* Mark it unlocked again and drop the page.. */
@@ -2517,11 +2563,20 @@
                         break;
         }
         *ppos = pos;
+ err = wait_on_synced_pages(synced_pages, cnt_synced_pages);
+ if (err < 0 && status == 0) {
+ status = err;
+ }
 
         if (cached_page)
                 page_cache_free(cached_page);
 
+ if (file->f_flags & O_SYNC) {
+ write_inode_now(inode);
+ }
+
         err = written ? written : status;
+
 out:
         up(&inode->i_sem);
         return err;

--
Ulrich Kunitz (gefm21@uumail.de)

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:24 EST