Metadata in sys_sync_file_range and fadvise(DONTNEED)

From: Chad Talbott
Date: Fri Oct 31 2008 - 16:54:28 EST


We are looking at adding calls to posix_fadvise(DONTNEED) to various
data logging routines. This has two benefits:

- frequent write-out -> shorter queues give lower latency, also disk
is more utilized as writeout begins immediately

- less useless stuff in page cache

One problem with fadvise() (and ext2, at least) is that associated
metadata isn't scheduled with the data. So, for a large log file with
a high append rate, hundreds of indirect blocks are left to be written
out by periodic writeback. This metadata consists of single blocks
spaced by 4MB, leading to spikes of very inefficient disk utilization,
deep queues and high latency.

Andrew suggests a new SYNC_FILE_RANGE_METADATA flag for
sys_sync_file_range(), and leaving posix_fadvise() alone. That will
work for my purposes, but it seems like it leaves
posix_fadvise(DONTNEED) with a performance bug on ext2 (or any other
filesystem with interleaved data/metadata). Andrew's argument is that
people have expectations about posix_fadvise() behavior as it's been
around for years in Linux.

I'd like to get a consensus on what The Right Thing is, so I can move
toward implementing it and moving the logging code onto that
interface.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/