Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3preallocation

From: Mingming Cao
Date: Fri Apr 02 2004 - 20:43:43 EST


Hi Andrew,
Here is the second version of the ext3, mostly bug fixes and made the
changes you have suggested last time.

It's a stable version now. We have done overnight fsx tests on a 2 CPU
PIII 700MHz box, did not see any issues. We also run tiobench and untar
tests.

> - Please use u32 for block numbers everywhere.
I tried to make the changes as you suggested: use u32 for block numbers
everywhere. Then hit a problem: sometimes, especially when doing block
allocation or find a free block on bitmap, on success, it returns the
block number, on fail, it returns -1 to indicate the failure. I found
we can't use the u32 for those cases. ext3_new_block() and
ext3_free_block() did the same thing. So I just did wherever I could to
make a block number from int to u32, and keep it int if it also means
failure in the fail case. Hmm...Is there any good way to do this?

> You should search both bitmaps to find a block
> which really is allocatable. Otherwise you'll have
> ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> will be burnt.
Done.

> - I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
> reduce the goto spaghetti?
Re-organized. Merged conditions together and put them into a single loop. Now it does not contains any goto :-) See if the code looks clean and now...

> - Please provide a mount option which enables the feature, defaulting to
> "off".
Planning to do right after this version. Besides enable/disable the
reservation feature, I am thinking to enable the feature that could set
the the default reservation window size(in blocks) when the fs is
mounted. just one single mount option:"prealloc_window=n". When n=0,
it means turns off, when n>0, it means on, and the ext3 default
reservation window size for each file is n blocks(or 8 blocks, if 0< n <
8).

> - Make sure that you have a many-small-file test. Say, untar a kernel
> tree onto a clean filesystem and make sure that reading all the files in
> the tree is nice and fast.
Yes, have done this: untar linux-2.6.4 kernel tree to a clean ext3
filesystem. Verified the reservation is being discarded and the small
files are contiguous on-disk.

> Apart from that, looking good. Where are the benchmarks? ;)
One simple dd test, it's 4 times faster with the patch on a 2 way PIII
700Mhz box:

# cat /tmp/dd-large.sh
dd if=/dev/zero of=x1 bs=4k count=5000 &
dd if=/dev/zero of=x2 bs=4k count=5000 &
dd if=/dev/zero of=x3 bs=4k count=5000 &
dd if=/dev/zero of=x4 bs=4k count=5000 &
dd if=/dev/zero of=x5 bs=4k count=5000 &
dd if=/dev/zero of=x6 bs=4k count=5000 &
dd if=/dev/zero of=x7 bs=4k count=5000 &
dd if=/dev/zero of=x8 bs=4k count=5000 &
elm3b92:/mnt # time /tmp/dd-large.sh

# time /tmp/dd-large.sh
real 0m0.431s
user 0m0.001s
sys 0m0.009s

linux-2.6.4 kernel + ext3 reservation patch, window size 128 blocks:
# time /tmp/dd-large.sh
real 0m0.098s
user 0m0.001s
sys 0m0.009s

We also did tiobench sequential write tests on ext2, jfs and ext3 on
different number of threads(1,4,8,16,32 and 64), block size is 4k, file
size is 4000k. For ext3, we tried different reservation size from 8 to
128 blocks, as well as without any reservation. The test was done on a
8CPU PIII 700 i386 machine with 4G memory, on linux-2.6.4 kernel+patch
and linux2.6.4-mm1 kernel.

Attached is a graphic file show the tiobench results. It shows that
before the patch, the sequential write throughput on ext3 is pretty bad
compare with ext2 and jfs; with the patch, it's much better. And the
block allocation on disk for the files created is much less fragmented.

Planning to do dbench and other regression test later. Just what to
share with you the current status.

Patch attached below.

Thanks!
Mingming

fs/ext3/balloc.c | 581
+++++++++++++++++++++++++++++++++++++++++----
fs/ext3/file.c | 3
fs/ext3/ialloc.c | 6
fs/ext3/inode.c | 14 -
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 3
include/linux/ext3_fs_i.h | 12
include/linux/ext3_fs_sb.h | 4
8 files changed, 578 insertions(+), 52 deletions(-)

diff -urNp linux-2.6.4/fs/ext3/balloc.c 264-rsv/fs/ext3/balloc.c
--- linux-2.6.4/fs/ext3/balloc.c 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv/fs/ext3/balloc.c 2004-04-02 02:33:00.327572528 -0800
@@ -96,6 +96,96 @@ read_block_bitmap(struct super_block *sb
error_out:
return bh;
}
+#define EXT3_RESERVATION_DEBUG2
+/*
+ * The reservation window structure operations
+ * --------------------------------------------
+ * Operations include:
+ * dump, find, add, remove, is_empty, find_next_reservable_window, etc.
+ *
+ * We use sorted double linked list for the per-filesystem reservation
+ * window list. (like in vm_region).
+ *
+ * Initially, we keep those small operations in the abstract functions,
+ * so later if we need a better searching tree than double linked-list,
+ * we could easily switch to that without changing too much
+ * code.
+ */
+static inline void rsv_window_dump(struct reserve_window *head, char *fn)
+{
+ struct reserve_window *rsv;
+ printk("Block Allocation Reservation Windows Map (%s):\n", fn);
+ list_for_each_entry(rsv, &head->rsv_list, rsv_list) {
+ printk("reservation window 0x%p start: %d, end: %d\n",
+ rsv, rsv->rsv_start, rsv->rsv_end);
+ }
+}
+
+static inline int goal_in_my_reservation(struct reserve_window *rsv, unsigned long goal)
+{
+ return ((goal >= rsv->rsv_start) && (goal <= rsv->rsv_end));
+}
+
+/*
+ * find if the given block is within any reservation on the list
+ */
+static inline int rsv_window_find(struct reserve_window *start,
+ unsigned long block, unsigned long last_block)
+{
+ struct reserve_window *rsv;
+
+ list_for_each_entry(rsv, &start->rsv_list, rsv_list) {
+ if (goal_in_my_reservation(rsv, block))
+ return 1; /* found it*/
+ if (rsv->rsv_start > last_block)
+ return 0;
+ }
+ return 0;
+}
+
+static inline void rsv_window_add(struct reserve_window *rsv,
+ struct reserve_window *prev)
+{
+ /* insert the new reservation window after the head */
+ list_add(&rsv->rsv_list, &prev->rsv_list);
+}
+
+static inline void rsv_window_remove(struct reserve_window *rsv)
+{
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "Reservation Window is removed:"
+ " 0x%p start: %d, end: %d \n", rsv, rsv->rsv_start,
+ rsv->rsv_end);
+#endif
+ rsv->rsv_start = 0;
+ rsv->rsv_end = 0;
+ list_del(&rsv->rsv_list);
+ INIT_LIST_HEAD(&rsv->rsv_list);
+}
+static inline int rsv_is_empty(struct reserve_window *rsv)
+{
+ /* a valid reservation end block could not be 0 */
+ return (rsv->rsv_end == 0);
+}
+
+void ext3_discard_reservation(struct inode *inode)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ struct reserve_window *rsv = &ei->i_rsv_window;
+ spinlock_t *rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;
+
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "ext3_discard_reservation:"
+ " 0x%p start: %d, end: %d \n", rsv, rsv->rsv_start,
+ rsv->rsv_end);
+#endif
+
+ if (!rsv_is_empty(rsv)) {
+ spin_lock(rsv_lock);
+ rsv_window_remove(rsv);
+ spin_unlock(rsv_lock);
+ }
+}

/* Free given blocks, update quota and i_blocks field */
void ext3_free_blocks (handle_t *handle, struct inode * inode,
@@ -313,6 +403,33 @@ static inline int ext3_test_allocatable(
return ret;
}

+static inline int
+bitmap_search_next_usable_block(unsigned long start, struct buffer_head *bh,
+ unsigned long maxblocks)
+{
+ unsigned long next;
+ struct journal_head *jh = bh2jh(bh);
+
+ /*
+ * The bitmap search --- search forward alternately through the actual
+ * bitmap and the last-committed copy until we find a bit free in
+ * both
+ */
+ while (start < maxblocks) {
+ next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
+ if (next >= maxblocks)
+ return -1;
+ if (ext3_test_allocatable(next, bh))
+ return next;
+ jbd_lock_bh_state(bh);
+ if (jh->b_committed_data)
+ start = ext3_find_next_zero_bit(jh->b_committed_data,
+ maxblocks, next);
+ jbd_unlock_bh_state(bh);
+ }
+ return -1;
+}
+
/*
* Find an allocatable block in a bitmap. We honour both the bitmap and
* its last-committed copy (if that exists), and perform the "most
@@ -325,7 +442,6 @@ find_next_usable_block(int start, struct
{
int here, next;
char *p, *r;
- struct journal_head *jh = bh2jh(bh);

if (start > 0) {
/*
@@ -359,19 +475,8 @@ find_next_usable_block(int start, struct
* bitmap and the last-committed copy until we find a bit free in
* both
*/
- while (here < maxblocks) {
- next = ext3_find_next_zero_bit(bh->b_data, maxblocks, here);
- if (next >= maxblocks)
- return -1;
- if (ext3_test_allocatable(next, bh))
- return next;
- jbd_lock_bh_state(bh);
- if (jh->b_committed_data)
- here = ext3_find_next_zero_bit(jh->b_committed_data,
- maxblocks, next);
- jbd_unlock_bh_state(bh);
- }
- return -1;
+ here = bitmap_search_next_usable_block(here, bh, maxblocks);
+ return here;
}

/*
@@ -407,62 +512,445 @@ claim_block(spinlock_t *lock, int block,
*/
static int
ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, int goal, int *errp)
+ struct buffer_head *bitmap_bh, int goal, struct reserve_window * my_rsv)
{
- int i;
- int fatal;
- int credits = 0;
-
- *errp = 0;
+ unsigned long group_first_block, start, end;

- /*
- * Make sure we use undo access for the bitmap, because it is critical
- * that we do the frozen_data COW on bitmap buffers in all cases even
- * if the buffer is in BJ_Forget state in the committing transaction.
- */
- BUFFER_TRACE(bitmap_bh, "get undo access for new block");
- fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
- if (fatal) {
- *errp = fatal;
- goto fail;
+ /* if EXT3_RESERVATION */
+ if (my_rsv) {
+ group_first_block =
+ le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ if (my_rsv->rsv_start >= group_first_block)
+ start = my_rsv->rsv_start - group_first_block;
+ else
+ /* reservation window cross group boundary */
+ start = group_first_block;
+ end = my_rsv->rsv_end - group_first_block;
+ if (end > EXT3_BLOCKS_PER_GROUP(sb))
+ /* reservation window cross group boundary */
+ end = EXT3_BLOCKS_PER_GROUP(sb);
+ }
+ else {
+ if (goal > 0)
+ start = goal;
+ else
+ start = 0;
+ end = EXT3_BLOCKS_PER_GROUP(sb);
}

repeat:
if (goal < 0 || !ext3_test_allocatable(goal, bitmap_bh)) {
- goal = find_next_usable_block(goal, bitmap_bh,
- EXT3_BLOCKS_PER_GROUP(sb));
+ goal = find_next_usable_block(start, bitmap_bh, end);
if (goal < 0)
goto fail_access;

- for (i = 0; i < 7 && goal > 0 &&
+ /*for (i = 0; i < 7 && goal > 0 &&
ext3_test_allocatable(goal - 1, bitmap_bh);
i++, goal--);
+ */
}
+ start = goal;

if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
/*
* The block was allocated by another thread, or it was
* allocated and then freed by another thread
*/
- goal++;
- if (goal >= EXT3_BLOCKS_PER_GROUP(sb))
+ start++;
+ if (start >= end)
goto fail_access;
goto repeat;
}

- BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
- fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ return goal;
+fail_access:
+ return -1;
+}
+/**
+ * find_next_reservable_window():
+ * find a reservable space within the given range
+ * It does not allocate the reservation window for now
+ * alloc_new_reservation() will do the work later.
+ *
+ * @search_head: the head of the searching list;
+ * This is not necessary the list head of the whole filesystem
+ *
+ * we have both head and start_block to assist the search
+ * for the reservable space. The list start from head,
+ * but we will shift to the place where start_block is,
+ * then start from there, we looking for a resevable space.
+ *
+ * @fs_rsv_head: per-filesystem reervation list head.
+ *
+ * @size: the target new reservation window size
+ * @group_first_block: the first block we consider to start
+ * the real search from
+ *
+ * @last_block:
+ * the maxium block number that our goal reservable space
+ * could start from. This is normally the last block in this
+ * group. The search will end when we found the start of next
+ * possiblereservable space is out of this boundary.
+ * This could handle the cross bounday reservation window request.
+ *
+ * basically we search from the given range, rather than the whole
+ * reservation double linked list, (start_block, last_block)
+ * to find a free region that of of my size and has not
+ * been reserved.
+ *
+ * on succeed, it returns the reservation window to be append to.
+ * failed, return NULL.
+ */
+static inline
+struct reserve_window* find_next_reservable_window(
+ struct reserve_window *search_head,
+ struct reserve_window *fs_rsv_head,
+ unsigned short size, unsigned long *start_block,
+ unsigned long last_block)
+{
+ struct reserve_window *rsv;
+ unsigned long cur;
+
+ /* TODO:make the start of the reservation window byte alligned */
+ /*cur = *start_block & 8;*/
+ cur = *start_block;
+ rsv = list_entry(search_head->rsv_list.next, struct reserve_window, rsv_list);
+ while (rsv != fs_rsv_head) {
+ if (cur + size <= rsv->rsv_start) {
+ /*
+ * found a reserable space big enough
+ * we could have a reservation cross
+ * the group boundary here
+ */
+ break;
+ }
+ if (cur <= rsv->rsv_end)
+ cur = rsv->rsv_end + 1;
+
+ /* TODO?
+ * in the case we could not find a reservable space
+ * that is what is expected, during the research, we could
+ * remember what's the largest reservable space we could have
+ * and return that on.
+ *
+ * for now it will fail if we could not find the reservable
+ * space with expected-size (or more)...
+ */
+ rsv = list_entry(rsv->rsv_list.next, struct reserve_window, rsv_list);
+ if (cur > last_block)
+ return NULL; /* fail */
+ }
+ /*
+ * we come here either :
+ * when we rearch to the end of the whole list,
+ * and there is empty reservable space after last entry in the list.
+ * append it to the end of the list.
+ *
+ * or we found one reservable space in the middle of the list,
+ * return the reservation window that we could append to.
+ * succeed.
+ */
+ *start_block = cur;
+ return list_entry(rsv->rsv_list.prev, struct reserve_window, rsv_list);
+}
+
+/**
+ * alloc_new_reservation()--allocate a new reservation window
+ * if there is an existing reservation, discard it first
+ * then allocate the new one from there
+ * otherwise allocate the new reservation from the given
+ * start block, or the beginning of the group, if a goal
+ * is not given.
+ *
+ * To make a new reservation, we search part of the filesystem
+ * reservation list(the list that inside the group).
+ *
+ * If we have a old reservation, the search goal is the end of
+ * last reservation. If we do not have a old reservatio, then we
+ * start from a given goal, or the first block of the group, if
+ * the goal is not given.
+ *
+ * We first find a reservable space after the goal, then from
+ * there,we check the bitmap for the first free block after
+ * it. If there is no free block until the end of group, then the
+ * whole group is full, we failed. Otherwise, check if the free block
+ * is inside the expected reservable space, if so, we succeed.
+ * If the first free block is outside the reseravle space, then
+ * start from the first free block, we search for next avalibale
+ * space, and go on.
+ *
+ * on succeed, a new reservation will be found and inserted into the list
+ * It contains at least one free block, and it is not overlap with other
+ * reservation window.
+ *
+ * failed: we failed to found a reservation window in this group
+ *
+ * @rsv: the reservation
+ *
+ * @goal: The goal. It is where the search for a
+ * free reservable space should start from.
+ * if we have a old reservation, start_block is the end of
+ * old reservation. Otherwise,
+ * if we have a goal(goal >0 ), then start from there,
+ * no goal(goal = -1), we start from the first block
+ * of the group.
+ *
+ * @sb: the super block
+ * @group: the group we are trying to do allocate in
+ * @bitmap_bh: the block group block bitmap
+ */
+static int alloc_new_reservation(struct reserve_window *my_rsv,
+ int goal, struct super_block *sb,
+ unsigned int group, struct buffer_head *bitmap_bh)
+{
+ struct reserve_window *search_head;
+ unsigned long group_first_block, group_end_block, start_block;
+ int first_free_block;
+ int reservable_space_start;
+ struct reserve_window *prev_rsv;
+ struct reserve_window *fs_rsv_head = &EXT3_SB(sb)->s_rsv_window_head;
+ unsigned short size;
+
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+ group_end_block = group_first_block + EXT3_BLOCKS_PER_GROUP(sb) - 1;
+
+ if (goal < 0)
+ start_block = group_first_block;
+ else
+ start_block = goal + group_first_block;
+
+ /* if we have a old reservation, discard it first */
+ if (!rsv_is_empty(my_rsv)) {
+ /*
+ * if the old reservation is cross group boundary
+ * we will come here when we just failed to allocate from
+ * the first part of the window. We still have another part
+ * that belongs to the next group. In this case, there is no
+ * point to discard our window and try to allocate a new one
+ * in this group(which will fail). we should
+ * keep the reservation window, just simply move on.
+ *
+ * Maybe we could shift the start block of the reservation window
+ * to the first block of next group...
+ */
+ if (my_rsv->rsv_end >= group_end_block)
+ return -1;
+
+ /* remember where we are before we discard the old one */
+ if (my_rsv->rsv_end + 1 > start_block)
+ start_block = my_rsv->rsv_end + 1;
+ search_head = list_entry(my_rsv->rsv_list.prev,
+ struct reserve_window, rsv_list);
+
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "Reservation Window is removed from alloc_new_"
+ " 0x%p start: %d, end: %d \n", my_rsv, my_rsv->rsv_start,
+ my_rsv->rsv_end);
+#endif
+ rsv_window_remove(my_rsv);
+ }
+ else {
+ /*
+ * we don't have a reservation,
+ * we set our goal(start_block) and
+ * the list head for the search
+ */
+ search_head = fs_rsv_head;
+ }
+
+ /*
+ * find_next_reservable_window() simply find a reservable window
+ * inside the given range(start_block, group_end_block).
+ *
+ * To make sure the reservation window has a free bit inside it, we need
+ * to check the bitmap after we found a reservable window.
+ */
+ size = my_rsv->rsv_goal_size;
+retry:
+ prev_rsv = find_next_reservable_window(search_head, fs_rsv_head, size,
+ &start_block, group_end_block);
+ if (prev_rsv == NULL){
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "NO WINDOW. start_block %d, group_end_block %d,"
+ "search_head %p, fs_rsv_head %p\n",
+ start_block, group_end_block, search_head,
+ fs_rsv_head);
+#endif
+ goto failed;
+ }
+ reservable_space_start = start_block;
+ /*
+ * on succeed, find_next_reservable_window() returns the
+ * reservation window where there is a reservable space after it.
+ * Before we reserve this reservable space, we need
+ * to make sure there is at least a free block inside this region.
+ *
+ * searching the first free bit on the block bitmap and copy of
+ * last committed bitmap alternatively, until we found a allocatable
+ * block. Search start from the start block of the reservable space
+ * we just found.
+ */
+ first_free_block = bitmap_search_next_usable_block(
+ reservable_space_start - group_first_block,
+ bitmap_bh, group_end_block - group_first_block + 1);
+
+ if (first_free_block < 0) {
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "group_first_block %d, group_end_block %d,"
+ " reservable_space_start %d\n", group_first_block,
+ group_end_block, reservable_space_start);
+#endif
+ /*
+ * no free block left on the bitmap, no point
+ * to reserve the space. return failed.
+ */
+ goto failed;
+ }
+ start_block = first_free_block + group_first_block;
+ /*
+ * check if the first free block is within the
+ * free space we just found
+ */
+ if ((start_block >= reservable_space_start) &&
+ (start_block < reservable_space_start + size))
+ goto found_rsv_window;
+ /*
+ * if the first free bit we found is out of the reservable space
+ * this means there is no free block on the reservable space
+ * we should continue search for next reservable space,
+ * start from where the free block is,
+ * we also shift the list head to where we stopped last time
+ */
+ search_head = prev_rsv;
+ goto retry;
+
+found_rsv_window:
+ /*
+ * great! the reservable space contains some free blocks.
+ * Insert it to the list.
+ */
+ rsv_window_add(my_rsv, prev_rsv);
+ my_rsv->rsv_start = reservable_space_start;
+ my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
+#ifdef EXT3_RESERVATION_DEBUG
+ printk(KERN_DEBUG "New reservation window allocated:"
+ " 0x%p start: %d, end: %d \n", my_rsv, my_rsv->rsv_start,
+ my_rsv->rsv_end);
+ rsv_window_dump(fs_rsv_head, "alloc_new_reservation");
+#endif
+
+ return 0; /* succeed */
+failed:
+ return -1; /* failed */
+}
+/*
+ * This is the main function used to allocate a new block and
+ * it's reservation window.
+ * each time when a new block allocation is need, first try to allocate
+ * from it's own reservation.
+ * If it does not have a reservation window, instead of looking
+ * for a free bit on bitmap first, then look up the reservation list to see if
+ * it is inside somebody else's reservation window,
+ * we try to allocate a reservation window for it start from the goal first.
+ * Then do the block allocation within the reservation window.
+ *
+ * This will aviod keep searching the reservation list again and again
+ * when someboday is looking for a free block(without reservation),
+ * and there are lots of free blocks, but they are all being reserved
+ *
+ * We use a sorted double linked list for the per-filesystem reservation list.
+ * The insert, remove and find a free space(non-reserved) operations for the
+ * sorted double linked list should be fast.
+ *
+ */
+static int
+ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
+ unsigned int group, struct buffer_head *bitmap_bh,
+ int goal, struct reserve_window * my_rsv,
+ int *errp)
+{
+ spinlock_t *rsv_lock;
+ unsigned long group_first_block;
+ int ret = 0;
+ int fatal;
+ int credits = 0;
+
+ *errp = 0;
+
+ /*
+ * Make sure we use undo access for the bitmap, because it is critical
+ * that we do the frozen_data COW on bitmap buffers in all cases even
+ * if the buffer is in BJ_Forget state in the committing transaction.
+ */
+ BUFFER_TRACE(bitmap_bh, "get undo access for new block");
+ fatal = ext3_journal_get_undo_access(handle, bitmap_bh, &credits);
if (fatal) {
*errp = fatal;
- goto fail;
+ return -1;
}
- return goal;

-fail_access:
+#ifdef EXT3_RESERVATION
+ rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+ /*
+ * goal is a group relative block number (if there is a goal)
+ * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
+ * first block is a filesystem wide block number
+ * first block is the block number of the first block in this group
+ */
+ group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
+ group * EXT3_BLOCKS_PER_GROUP(sb);
+
+ /*
+ * Basically we will allocate a new block from inode's reservation window.
+ *
+ * We need to allocate a new reservation window, if:
+ * a) inode does not have a reservation window; or
+ * b) last attemp of allocating a block from existing reservation failed; or
+ * c) we come here with a goal and with a reservation window
+ *
+ * we do not need to allocate a new reservation window if
+ * we come here at the beginning with a goal and the goal is inside the window, or
+ * or we don't have a goal but already have a reservation window.
+ * then we could go to allocate from the reservation window directly.
+ */
+ while (1) {
+ if (rsv_is_empty(my_rsv) || (ret < 0) ||
+ ((goal >= 0) && !goal_in_my_reservation(my_rsv, goal+group_first_block))) {
+ spin_lock(rsv_lock);
+ ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh);
+ spin_unlock(rsv_lock);
+ if (ret < 0)
+ break; /* failed */
+
+ if ((goal > 0) && !goal_in_my_reservation(my_rsv, goal+group_first_block))
+ goal = -1;
+ }
+
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+ my_rsv);
+ if (ret >= 0)
+ break; /* succeed */
+ }
+#else
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+#endif
+
+ if (ret >= 0) {
+ BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
+ fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ if (fatal) {
+ *errp = fatal;
+ return -1;
+ }
+ return ret;
+ }
+ return ret;
+
BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
ext3_journal_release_buffer(handle, bitmap_bh, credits);
-fail:
- return -1;
+ return ret;
}

/*
@@ -490,6 +978,7 @@ ext3_new_block(handle_t *handle, struct
struct ext3_group_desc *gdp;
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
+ struct reserve_window *my_rsv = &EXT3_I(inode)->i_rsv_window;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -540,8 +1029,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, ret_block, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, ret_block, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -569,8 +1058,8 @@ ext3_new_block(handle_t *handle, struct
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
- ret_block = ext3_try_to_allocate(sb, handle, group_no,
- bitmap_bh, -1, &fatal);
+ ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+ bitmap_bh, -1, my_rsv, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
diff -urNp linux-2.6.4/fs/ext3/file.c 264-rsv/fs/ext3/file.c
--- linux-2.6.4/fs/ext3/file.c 2004-03-10 18:55:49.000000000 -0800
+++ 264-rsv/fs/ext3/file.c 2004-03-30 18:21:30.463306312 -0800
@@ -34,7 +34,8 @@
static int ext3_release_file (struct inode * inode, struct file * filp)
{
if (filp->f_mode & FMODE_WRITE)
- ext3_discard_prealloc (inode);
+ ext3_discard_reservation(inode);
+ /*ext3_discard_prealloc (inode);*/
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);

diff -urNp linux-2.6.4/fs/ext3/ialloc.c 264-rsv/fs/ext3/ialloc.c
--- linux-2.6.4/fs/ext3/ialloc.c 2004-03-10 18:55:27.000000000 -0800
+++ 264-rsv/fs/ext3/ialloc.c 2004-04-02 22:26:48.615417624 -0800
@@ -29,6 +29,7 @@
#include "xattr.h"
#include "acl.h"

+
/*
* ialloc.c contains the inodes allocation and deallocation routines
*/
@@ -585,6 +586,11 @@ got:
ei->i_prealloc_block = 0;
ei->i_prealloc_count = 0;
#endif
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
+
ei->i_block_group = group;

ext3_set_inode_flags(inode);
diff -urNp linux-2.6.4/fs/ext3/inode.c 264-rsv/fs/ext3/inode.c
--- linux-2.6.4/fs/ext3/inode.c 2004-03-10 18:55:35.000000000 -0800
+++ 264-rsv/fs/ext3/inode.c 2004-04-02 22:26:58.250952800 -0800
@@ -186,7 +186,9 @@ static int ext3_journal_test_restart(han
void ext3_put_inode(struct inode *inode)
{
if (!is_bad_inode(inode))
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */
}

/*
@@ -2137,8 +2139,9 @@ void ext3_truncate(struct inode * inode)
return;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return;
-
- ext3_discard_prealloc(inode);
+
+ ext3_discard_reservation(inode);
+ /* ext3_discard_prealloc(inode); */

/*
* We have to lock the EOF page here, because lock_page() nests
@@ -2535,7 +2538,10 @@ void ext3_read_inode(struct inode * inod
ei->i_prealloc_count = 0;
#endif
ei->i_block_group = iloc.block_group;
-
+ ei->i_rsv_window.rsv_start = 0;
+ ei->i_rsv_window.rsv_end= 0;
+ ei->i_rsv_window.rsv_goal_size = EXT3_DEFAULT_RESERVE_BLOCKS;
+ INIT_LIST_HEAD(&ei->i_rsv_window.rsv_list);
/*
* NOTE! The in-memory inode i_data array is in little-endian order
* even on big-endian machines: we do NOT byteswap the block numbers!
diff -urNp linux-2.6.4/fs/ext3/super.c 264-rsv/fs/ext3/super.c
--- linux-2.6.4/fs/ext3/super.c 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv/fs/ext3/super.c 2004-03-29 17:22:18.539077360 -0800
@@ -1291,6 +1291,13 @@ static int ext3_fill_super (struct super
sbi->s_gdb_count = db_count;
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
+ /* per fileystem reservation list head & lock */
+ spin_lock_init(&sbi->s_rsv_window_lock);
+ INIT_LIST_HEAD(&sbi->s_rsv_window_head.rsv_list);
+ sbi->s_rsv_window_head.rsv_start = 0;
+ sbi->s_rsv_window_head.rsv_end = 0;
+ sbi->s_rsv_window_head.rsv_goal_size = 0;
+
/*
* set up enough so that it can read an inode
*/
diff -urNp linux-2.6.4/include/linux/ext3_fs.h 264-rsv/include/linux/ext3_fs.h
--- linux-2.6.4/include/linux/ext3_fs.h 2004-03-10 18:55:33.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs.h 2004-04-02 22:26:23.335260792 -0800
@@ -37,6 +37,8 @@ struct statfs;
*/
#undef EXT3_PREALLOCATE /* @@@ Fix this! */
#define EXT3_DEFAULT_PREALLOC_BLOCKS 8
+#define EXT3_RESERVATION
+#define EXT3_DEFAULT_RESERVE_BLOCKS 8

/*
* Always enable hashed directories
@@ -728,6 +730,7 @@ extern void ext3_put_inode (struct inode
extern void ext3_delete_inode (struct inode *);
extern int ext3_sync_inode (handle_t *, struct inode *);
extern void ext3_discard_prealloc (struct inode *);
+extern void ext3_discard_reservation (struct inode *);
extern void ext3_dirty_inode(struct inode *);
extern int ext3_change_inode_journal_flag(struct inode *, int);
extern void ext3_truncate (struct inode *);
diff -urNp linux-2.6.4/include/linux/ext3_fs_i.h 264-rsv/include/linux/ext3_fs_i.h
--- linux-2.6.4/include/linux/ext3_fs_i.h 2004-03-10 18:55:21.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs_i.h 2004-03-30 18:39:30.556107248 -0800
@@ -18,8 +18,15 @@

#include <linux/rwsem.h>

+struct reserve_window{
+ struct list_head rsv_list;
+ __u32 rsv_start;
+ __u32 rsv_end;
+ unsigned short rsv_goal_size;
+};
+
/*
- * second extended file system inode data in memory
+ * third extended file system inode data in memory
*/
struct ext3_inode_info {
__u32 i_data[15];
@@ -61,6 +68,9 @@ struct ext3_inode_info {
__u32 i_prealloc_block;
__u32 i_prealloc_count;
#endif
+ /* block reservation window */
+ struct reserve_window i_rsv_window;
+
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
diff -urNp linux-2.6.4/include/linux/ext3_fs_sb.h 264-rsv/include/linux/ext3_fs_sb.h
--- linux-2.6.4/include/linux/ext3_fs_sb.h 2004-03-10 18:55:44.000000000 -0800
+++ 264-rsv/include/linux/ext3_fs_sb.h 2004-03-29 17:22:18.544076600 -0800
@@ -59,6 +59,10 @@ struct ext3_sb_info {
struct percpu_counter s_dirs_counter;
struct blockgroup_lock s_blockgroup_lock;

+ /* head of the per fs reservation window tree */
+ spinlock_t s_rsv_window_lock;
+ struct reserve_window s_rsv_window_head;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;

Attachment: Throughput.png
Description: PNG image