RE: [RFC PATCH] f2fs: add extent cache base on rb-tree

From: Chao Yu
Date: Mon Jan 12 2015 - 02:05:10 EST


Hi Jaegeuk,

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@xxxxxxxxxx]
> Sent: Wednesday, January 07, 2015 7:01 AM
> To: Chao Yu
> Cc: 'Changman Lee'; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
>
> Hi Chao,
>
> On Sun, Jan 04, 2015 at 04:24:10PM +0800, Chao Yu wrote:
> > Hi Jaegeuk,
> >
> > > -----Original Message-----
> > > From: Jaegeuk Kim [mailto:jaegeuk@xxxxxxxxxx]
> > > Sent: Wednesday, December 31, 2014 4:26 PM
> > > To: Chao Yu
> > > Cc: 'Changman Lee'; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> > >
> > > Hi Chao,
> > >
> > > On Tue, Dec 30, 2014 at 06:10:21PM +0800, Chao Yu wrote:
> > > > Hi Jaegeuk,
> > > >
> > > > > -----Original Message-----
> > > > > From: Jaegeuk Kim [mailto:jaegeuk@xxxxxxxxxx]
> > > > > Sent: Tuesday, December 30, 2014 5:23 AM
> > > > > To: Chao Yu
> > > > > Cc: 'Changman Lee'; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> > > > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> > > > >
> > > > > Hi Chao,
> > > > >
> > > > > On Mon, Dec 29, 2014 at 03:19:18PM +0800, Chao Yu wrote:
> > > > >
> > > > > [snip]
> > > > >
> > > > > Nice draft. :)
> > > >
> > > > Thanks! :)
> > > >
> > > > >
> > > > > >
> > > > > > Please see the draft below.
> > > > > >
> > > > > > 1) Extent management:
> > > > > > If we use global management that managing all extents which are from different
> > > > > > inodes in sbi, we will face with serious lock contention when we access these
> > > > > > extents belong to different inodes concurrently, the loss may outweights the
> > > > > > gain.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > > So we choose a local management for extent which means all extents are
> > > > > > managed by inode itself to avoid above lock contention. Addtionlly, we manage
> > > > > > all extents globally by linking all inode into a global lru list for extent
> > > > > > cache shrinker.
> > > > > > Approach:
> > > > > > a) build extent tree/rwlock/lru list/extent count in each inode.
> > > > > > *extent tree: link all extent in rb-tree;
> > > > > > *rwlock: protect fields when accessing extent cache concurrently;
> > > > > > *lru list: sort all extents in accessing time order;
> > > > > > *extent count: record total count of extents in cache.
> > > > > > b) use lru shrink list in sbi to manage all inode which cached extents.
> > > > > > *inode will be added or repostioned in this global list whenever
> > > > > > extent is being access in this inode.
> > > > > > *use spinlock to protect this shrink list.
> > > > >
> > > > > 1. How about adding a data structure with inode number instead of referring
> > > > > inode pointer?
> > > > >
> > > > > 2. How about managing extent entries globally and setting an upper bound to
> > > > > the number of extent entries instead of limiting them per each inode?
> > > > > (The rb-tree will handle many extents per inode.)
> > > >
> > > > [assumption]
> > > > Approach A: global lru list;
> > > > Approach B: lru list per inode.
> > > >
> > > > I have considered Approach A before writing the draft, although Approach A could
> > > > provide us higher ratio hit due to global lru, it may make lock contention more
> > > > intensively and has more memory overhead descripted below. So I choose more
> > > > conservative Approach B.
> > > > 1) as extent_entry will split in Cow, we may lock extent_list more times when
> > > > move these separated extent entries in extent_list, unless we have batch mode
> > > > interface.
> > > > 2) lock overhead between shrinker and lookuper and updater.
> > > > e.g. extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
> > > > (#1) has A, C, E, G in rb-tree
> > > > (#2) has B, D, F in rb-tree
> > > > shrinker op: lock list -> trylock #1 -> unlock list -> free A -> unlock #1
> > > > lock list -> trylock #2 -> unlock list -> free B -> unlock #2
> > > > lock list -> trylock #1 -> unlock list -> free C -> unlock #1
> > > > ...
> > >
> > > lock_list -> list_del(A) -> unlock_list -> lock #1 -> remove_tree(A) -> unlock #1
> > > -> free A.
> >
> > If unlock_list before lock #1, (A) could be modified by other thread between ->unlock_list
> > and lock #1, so one more "lookup_tree(A)" is needed under lock #1:
> > lock #1 -> lookup_tree(A) -> remove_tree(A) -> unlock #1
> >
> > >
> > > Or,
> > > lock_list -> list_del(A, B, C) -> unlock_list -> lock #1 -> remove_tree(A, C) ->
> > > unlock #1 -> free A, C -> lock #2 -> remove_tree(B) -> unlock #2
> >
> > ditto.
> > To avoid unneeded lookup_tree(A), maybe we can use this series:
> >
> > lock list -> get A from head -> trylock #1 -> try to get more node(C, E, G) of #1 in list
> > -> unlock list -> remove_tree&free(A, C, E, G) -> unlock #1
> >
> > >
> > > I think a couple of list operations do not cause severe lock contention.
> >
> > Yeah, batch operation is better.
> > But sometimes random write break extent into everywhere in shrinker's list,
> > so batch operation can gain less in this condition.
> >
> > > And, if we assing minimum length for extent caches, the contention would be
> > > distributed.
> > >
> > > This means that, it needs to manage each of all the extents in the cache should
> > > have the length larger than minimum value.
> >
> > If I do not misunderstand, what you mean is that if extent number in one inode cache
> > is bigger than minimum value, then, we will start to add all extents of this inode
> > into shrinker's list, and reposition them in the list when f2fs_{lookup,update}_extent_cache?
> >
> > >
> > > > *trylock/unlock* for all free extent entries belong to one inode will be separated
> > > > to lots of parts, resulting in more contention.
> > > > 3) it has more memory overhead in each extent entry:
> > > > Approach A: need add inode number for backref and list_head *
> > >
> > > It doesn't need to add ino. Just remain them, and shrinker can remove empty
> > > ino_entry_for_extents periodically.
> >
> > I do not understand why we don't need ino, :(.
> > Without ino, how can we get the rb-tree root pointer for rb_erase(node, root)?
>
> Let me describe in more detail.
>
> For example,
>
> - global radix tree in sbi
> --> ino #1 --> rb_tree (rb_tree, rb_tree_lock, refcount=0)
> --> ino #2 --> rb_tree
> --> ino #3 --> rb_tree
> `-> extent A
> `-> extent B (fofs, blkaddr, length, *list_head=empty)
>
> - global extent list in sbi
>
> extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
> (#1) has A, C, E, G in rb-tree
> (#2) has B, D, F in rb-tree
>
> 1) update extents (#1)
>
> - lock_radix_tree
> if (notfound #1)
> allocate #1;
> #1's refcount++;
> - unlock_radix_tree
>
> - lock_rb_tree (#1)
> update A
> rb_delete E, if E->length is smaller than MIN_LEN.
> allocate & rb_add K, if K->length is larget than MIN_LEN.
>
> list_lock
> list_move(A)
> list_del(E)
> list_add_tail(K)
> list_unlock
> - unlock_rb_tree (#1)
>
> - #1's rafcount--;
>
> 2) shrinker
>
> - list_lock
> list_del A, B, C, ...
> - list_unlock
>
> - gang_look_up_radix_tree_with_lock {
>
> #N's refcount++;
>
> lock_rb_tree (#N)
> for_each_rb_node {
> if (list_empty(extent->list_head)) {
> remove_rb_node(extent);
> free(extent);
> }
> }
> unlock_rb_tree (#N)
>
> #N's refcount--;
> - }
>
> - for_each_radix_tree_with_lock {
> if (radix_node->refcount == 0 && rb_tree_is_empty)
> free(radix_node);
> - }
>
> 3) lookup extents (#1)
>
> - lock_radix_tree
> if (notfound #1)
> goto out;
> #1's refcount++;
> - unlock_radix_tree
>
> - lock_rb_tree (#1)
> found extent A
>
> list_lock
> list_move(A)
> list_unlock
> - unlock_rb_tree (#1)
>
> - #1's rafcount--;

Really Good! Thanks for your suggestion and the work.

The following RFC patch set is written based on above detail design.

Please review and comment on it, thank you! :)

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/