Re: [PATCH] ext4: dir inode reservation V3

From: Coly Li
Date: Tue Nov 20 2007 - 11:32:18 EST


Jan,

Thanks for taking time to review the patch :-)

Jan Kara wrote:
> Hi Coly,
>
> finally I've found some time to have a look at a new version of your
> patch.
>
>> 5, Performance number
>> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
>> partition, and tried to create 50000 directories, and create 15 (1KB)
>> files in each directory alternatively. After a remount, I tried to
>> remove all the directories and files recursively by a 'rm -rf'. Bellow
>> is the benchmark result,
>> normal ext4 ext4 with dir inode reservation
>> mount options: -o data=writeback -o data=writeback,dir_ireserve=low
>> Create dirs: real 0m49.101s real 2m59.703s
>> Create files: real 24m17.962s real 21m8.161s
>> Unlink all: real 24m43.788s real 17m29.862s
>> Creating dirs with dir inode reservation is slower than normal ext4 as
>> predicted, because allocating directory inodes in non-linear order
>> will cause extra hard disk seeking and block I/O. Creating files with
>> dir inode reservation is 13% faster than normal ext4. Unlink all the
>> directories and files is 29.2% faster as expected. When number of
>> directories is increased, the performance improvement will be more
>> considerable. More benchmark result will be posted here if necessary,
>> because I need more time to run more test cases.
> Maybe to get some better idea - could you compare how long does take
> untar of a kernel tree, find through the whole kernel tree and removal
> of the tree? Also measuring CPU load during your benchmarks would be
> useful so that we can see whether we don't increase too much the cost of
> searching in bitmaps...

Sure, good ideas, I will add these items to my benchmark list.

>
> The patch is nicely short ;)
Thanks for the encouragement.
The first version was more than 2000 lines, including kernel and e2fsprogs patches. Finally I find
only around 100 lines kernel patch can achieve same performance too. Really nice :-)

>
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index 17b5df1..f838a72 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -10,6 +10,8 @@
>> * Stephen Tweedie (sct@xxxxxxxxxx), 1993
>> * Big-endian to little-endian byte-swapping/bitmaps by
>> * David S. Miller (davem@xxxxxxxxxxxxxxxx), 1995
>> + * Directory inodes reservation by
>> + * Coly Li (coyli@xxxxxxx), 2007
>> */
>>
>> #include <linux/time.h>
>> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
>> return -1;
>> }
>>
>> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
>> + int mode, ext4_group_t *group, unsigned long *ino)
>> +{
>> + struct super_block *sb;
>> + struct ext4_sb_info *sbi;
>> + struct ext4_group_desc *gdp = NULL;
>> + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
>> + ext4_group_t ires_group = *group;
>> + unsigned long ires_ino;
>> + int i, bit;
>> +
>> + sb = dir->i_sb;
>> + sbi = EXT4_SB(sb);
>> +
>> + /* if the inode number is not for directory,
>> + * only try to allocate after directory's inode
>> + */
>> + if (!S_ISDIR(mode)) {
>> + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb);
>> + return 0;
>> + }
> ^^^ You don't set a group here - is this intentional? It means we get
> the group from find_group_other() and there we start searching at a
> place corresponding to parent's inode number... That would be an
> interesting heuristic but I'm not sure if that's what you want.

Yes, if allocating a file inode, just return first inode offset in the reserved area of parent
directory. In this case, group is decided by find_group_other() or find_group_orlov(),
ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot
after parent's inode.

>
>> +
>> + /* reserve inodes for new directory */
>> + for (i = 0; i < sbi->s_groups_count; i++) {
>> + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh);
>> + if (!gdp)
>> + goto fail;
>> + bit = 0;
>> +try_same_group:
>> + if (bit < EXT4_INODES_PER_GROUP(sb)) {
>> + brelse(bitmap_bh);
>> + bitmap_bh = read_inode_bitmap(sb, ires_group);
>> + if (!bitmap_bh)
>> + goto fail;
>> +
>> + BUFFER_TRACE(bitmap_bh, "get_write_access");
>> + if (ext4_journal_get_write_access(
>> + handle, bitmap_bh) != 0)
>> + goto fail;
>> + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
>> + bit, bitmap_bh->b_data)) {
>> + /* we won it */
>> + BUFFER_TRACE(bitmap_bh,
>> + "call ext4_journal_dirty_metadata");
>> + if (ext4_journal_dirty_metadata(handle,
>> + bitmap_bh) != 0)
>> + goto fail;
>> + ires_ino = bit;
>> + goto find;
>> + }
>> + /* we lost it */
>> + jbd2_journal_release_buffer(handle, bitmap_bh);
>> + bit += sbi->s_dir_ireserve_nr;
>> + goto try_same_group;
>> + }
> So this above is just a while loop coded with goto... While loop
> would be IMO better.

The only reason for me to use a goto, is 80 column limitation :) BTW, goto does not hurt performance
and readability here. IMHO, it's acceptable :-)

>
> Honza

Today I finished V4 patch, which fixs a bug and provides new mount option to configure reserved
inodes number. It can be faster on creating empty directories, I will release it once I finished
basic benchmark. Your advice on benchmark is valuable, I will post the benchmark result next week.

Best regards.

--
Coly Li
SuSE PRC Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/