O_DIRECT

From: Andrea Arcangeli (andrea@suse.de)
Date: Thu Apr 12 2001 - 16:09:44 EST


I wrote the O_DIRECT zerocopy raw I/O support (dma from disk to the userspace
memory through the filesystem). The patch against 2.4.4pre2 + rawio-3 is here:

        ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre2/o_direct-1

Only ext2 is supported at the moment, but extending it to the other fses that
use the common code helper functions is trivial (I guess Chris will take care
of reiserfs, it may be an option to not do tail packing for files opened with
O_DIRECT so you can dma from userspace the tail as well).

The above patch depends on the rawio performance improvement patch posted to
the list a few days ago here (latest version here):

        ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre2/rawio-3

The rawio-3 patch is suggested for integration, certainly it's saner and faster
than mainline 2.4 (a note of credit: part of the fixes in the rawio-3 patch
are merged from SCT's patch).

To benchmark the improvement given by O_DIRECT, I hacked bonnie to open the
file with O_DIRECT in the "block" tests and I changed the chunk size to 1MB (so
that the blkdev layer will send large requests the hardware). Then I made a
comparision between the bonnie numbers w/o and w/ O_DIRECT (the -o_direct param
to bonnie now selects the O_DIRECT or standard behaviour, I also added
a -fast param to skip the slow seek test). I cutted out the numbers that
aren't using O_DIRECT to make the report more readable.

Those are still preliminary results on a mid machine: 2-way SMP PII 450mhz with
128mbyte of ram on a lvm volume (physical volume is a single IDE disk, so not
striped and all in the same harddisk) using a working set of 400mbytes.

without using o_direct:
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
           MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
          400 xxxx xxxx 12999 12.1 5918 10.8 xxxx xxxx 13412 12.1 xxx xxx
          400 xxxx xxxx 12960 12.3 5896 11.1 xxxx xxxx 13520 13.3 xxx xxx

with o_direct:
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
           MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
          400 xxxx xxxx 12810 1.8 5855 1.6 xxxx xxxx 13529 1.2 xxx xxx
          400 xxxx xxxx 12814 1.8 5866 1.7 xxxx xxxx 13519 1.3 xxx xxx

As you can see there's a small performance drop in writes and I guess it's
beause we are some more synchronous and it takes a bit more of time between the
completion of one I/O request and the the sumbission of a new one and that's ok.
(if you really care that litte difference can be covered by using async-io)

The most interesting part is the cpu load that just decreases down to 1/2%
during all the I/O and this is ""only"" a 13mbyte/sec harddisk and on
top of lvm. I didn't run any benchmark on any faster disk and without lvm but
O_DIRECT is the obvious way to go for streaming endless data to disk (possibly
at 100mbyte/sec or more) like in multimedia or scientific apps and of course
with DBMS that do their own userspace management of the I/O cache in shm.

>From a DBMS point of view the only downside of the O_DIRECT compared to the
rawio device is: 1) walking of the meatadata in the fs [but that is in turn
_the_ feature that gives more flexibility to the administrator] and 2) O_DIRECT
cannot be done on a shared disk without also using a filesystem like GFS
because the regular fs doesn't know how to keep the metadata coherent across
multiple hosts.

Programming-wise the coherency of the cache under the direct I/O is the only
non obvious issue, what I did is been simply to flush the data (nothing of the
metadata!) before starting the direct I/O and to discard all the unmapped
pagecache from the inode after any direct write or to invalidate them (clear
all dirty bits in the pagecache and on the overlapped buffers if the page was
mapped so the next read through the cache will hit the disk again). This seems
safe even if it somehow breaks the semantics of mmaps (for example if the file
is mmaped during the o_direct the mmap view won't be updated, but o_direct
after all is magic anyways in the way it requires alignment of address and size
of the user buffer so this doesn't seem to be a problem... and not keeping
perfect coherency of the not useful cases increases performance of the useful
cases. I didn't see a value in updating the cache, if you want to update the
cache in a mapping simply don't use O_DIRECT ;).

To invalidate the cache I couldn't use the invalidate_inode_pages and
truncate_inode_pages, the former was too weak (I cannot simply skip the
stuff that is not just clean and unmapped or the user could never see
the updates afterwards when he stops using O_DIRECT), and the latter was too
aggressive (it can be used only once the inode is getting released and we know
there are no active mmaps on the file, the set_page_dirty was oopsing on me
because the page->i_mapping become null in a shared mapping for example). So I
wrote a third invalidate_inode_pages2 that also seems attractive from a NFS
point of view though I didn't changed nfs to use it, if we change the VM to be
robust about mapped pages getting removed by the pagecache I could use
truncate_inode_pages() in the future.

One important detail is that the invalidate_inode_pages2 doesn't get
a "range" of addresses in the address_space but it invalidates the whole
address_space instead. At first I was invalidating only the modified range and
after the putc test (that loads the cache) the rewrite tests was overloading
the cpu (around 30% of cpu usage). After I started to cut all the address space
the cpu load during rewrite test returned to the early numbers without cache
coherency in O_DIRECT. (I was also browsing the list in the less optimal order
but I preferred to drop the range option to not force people to always
read physically consecutive to avoid the quadratic behaviour :)

As mentioned above if somebody keeps mmaps on the file the
invalidate_inode_pages2 won't be able to drop the pages from the list and it
will waste some CPU in kernel space. Only thing I did is to put the reschedule
points in the right places so the machine will remain perfectly responsive if
somebody tries to exploit it.

Some other explanation on the i_dirty_data_buffers list (probably you are
wondering why I added it): in the early stage I didn't flush the cache before
starting the direct I/O and I was getting performance like now in the final
patch. When the thing started working and I started to care about having
something that works with mixed O_DIRECT and non direct I/O I added a
generic_osync_inode (ala generic_file_write) to flush the cache to disk before
starting the direct I/O. This degraded significantly the performance because
the generic_osync_inode goes down and writes synchronously all the metadata as
well. More precisely I was getting those numbers:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
           MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
          400 xxxx xxxx 11138 1.7 5880 3.2 xxxx xxxx 13519 1.7 xxxx xxxx
                         ^^^^^ ^^^
(I am writing only one line but the numbers were very stable and it was really a
noticeable slowdown)

To resurrect the performance of the original patch so I had to split the inode buffer
list in two, one for the data and one for the metadata, and now I only flush and
wait synchronously I/O completion of the i_dirty_data_buffers list before
starting the direct I/O and this fixed the performance problem completly.

If O_SYNC is used in combination of O_DIRECT I in turn skip the i_dirty_data_buffers
list and I only flush the metadata (i_dirty_buffers list) and the inode if
there was metadata dirty (this is why I introduced a third case in the
generic_osync_inode, this in practice probably doesn't make any difference though).

Another change I did in the brw_kiovec is that in the block[] array the
blocknumber -1UL is reserved and it's used during reads from disk to indicate
that the destination memory should be cleared (it's used to handle holes in the
files).

Andrea



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 15 2001 - 21:00:19 EST