Implement pageout clustering at the VM level.
With this patch VM scanner calls pageout_cluster() instead of
->writepage(). pageout_cluster() tries to find a group of dirty pages around
target page, called "pivot" page of the cluster. If group of suitable size is
found, ->writepages() is called for it, otherwise, page_cluster() falls back
to ->writepage().
This is supposed to help in work-loads with significant page-out of
file-system pages from tail of the inactive list (for example, heavy dirtying
through mmap), because file system usually writes multiple pages more
efficiently. Should also be advantageous for file-systems doing delayed
allocation, as in this case they will allocate whole extents at once.
Few points:
- swap-cache pages are not clustered (although they can be, but by
page->private rather than page->index)
- currently, kswapd clusters all the time, and direct reclaim only when
device queue is not congested. Probably direct reclaim shouldn't cluster at
all.
- this patch adds new fields to struct writeback_control and expects
->writepages() to interpret them. This is needed, because pageout_cluster()
calls ->writepages() with pivot page already locked, so that ->writepages()
is allowed to only trylock other pages in the cluster.
Besides, rather rough plumbing (wbc->pivot_ret field) is added to check
whether ->writepages() failed to write pivot page for any reason (in latter
case page_cluster() falls back to ->writepage()).
Only mpage_writepages() was updated to honor these new fields, but
all in-tree ->writepages() implementations seem to call
mpage_writepages(). (Except reiser4, of course, for which I'll send a
(trivial) patch, if necessary).
Numbers that talk:
Averaged number of microseconds it takes to dirty 1GB of
16-times-larger-than-RAM ext3 file mmaped in 1GB chunks:
without-patch: average: 74188417.156250
deviation: 10538258.613280
with-patch: average: 69449001.583333
deviation: 12621756.615280
(Patch is for 2.6.10-rc2)