Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()

From: Liam R. Howlett
Date: Mon Jul 31 2023 - 12:09:55 EST


* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230731 05:53]:
>
>
> 在 2023/7/26 22:58, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230726 04:10]:
> > > Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
> > > non-leaf nodes without knowing the maximum value of nodes, so that any
> > > ascending can be avoided even if the maximum value of nodes is not known.
> > >
> > > The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
> > > cannot be used by metadata, so we can distinguish whether it is ENODE or
> > > metadata.
> > >
> > > The nocheck version is to avoid lockdep complaining in some scenarios
> > > where no locks are held.
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> > > ---
> > > lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > 1 file changed, 68 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index a3d602cfd030..98e4fdf6f4b9 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
> > > #define MAPLE_ENODE_TYPE_SHIFT 0x03
> > > /* Bit 2 means a NULL somewhere below */
> > > #define MAPLE_ENODE_NULL 0x04
> > > +/* Bit 7 means this is an ENODE, instead of metadata */
> > > +#define MAPLE_ENODE 0x80
> >
> > We were saving this bit for more node types. I don't want to use this
> > bit for this reason since you could have done BFS to duplicate the tree
> > using the existing way to find the node end.
> We have reserved 4 bits for the node type. I don't think there will be
> more than 16 node types going forward.

We aren't using this bit for the metadata validation. We have plans for
the tree to be used elsewhere and I'd like to keep this bit in case we
need it.

>
> Even the DFS that has been implemented can use the existing way to get
> the data end.

Please try using it and see the impact on your performance testing.
Think about how often a node is full and how often a parent node is
full, etc, etc. As I brought up during the review of your testing code,
you actually have not filled the trees during construction. So there
will not be a walk up more than one, and I'm not even sure you will get
one walk up.

The nodes are also going to be in CPU cache, so even when you do need to
walk up multiple times, it's not going to be a huge performance
consideration.

Furthermore, the maple allocation nodes have metadata stored outside of
the data - there is extra space that we are using in those nodes always.
So for fork, all this is to make walking up to the parent of full leaves
faster - but it's not used for leaves so there is no benefit in that
case.

>I didn't use it because when walking up the tree, we don't
> know the maximum value of the node, and the continuous upward walk will
> introduce more overhead, which is what mas_ascend() does. Doing BFS
> cannot avoid this problem also.
>
> The reason I don't do BFS is that it has more overhead than DFS. If you
> think of a tree as a graph, doing DFS will only walk each edge twice.
> What if it is BFS? Since we can't use queues, we can only emulate BFS,
> which additionally does something like mas_next_node() does, which
> introduces more overhead than DFS. Considering only the layer of leaf
> nodes, it needs to walk each edge twice. So the overhead of doing BFS is
> more than DFS.

Okay, thanks.

> >
> > Bits are highly valuable and this is the only remaining bit. I had
> > thought about using this in Feb 2021 to see if there was metadata or
> > not, but figured a way around it (using the max trick) and thus saved
> > this bit for potential expansion of node types.
> I thought of another way to get the maximum value of a node without
> doing any extra upward walk. When doing DFS, we can use a stack to save
> the maximum value of ancestor nodes. The stack size can be set to
> MAPLE_HEIGHT_MAX. In this way, this bit can be reserved, and there is no
> need to do a loop like mas_ascend() in order to get the maximum value.

You want an array of 31 unsigned longs to keep track of the max of this
node? Please don't do that.

> >
> > > +
> > > +static inline bool slot_is_mte(unsigned long slot)
> > > +{
> > > + return slot & MAPLE_ENODE;
> > > +}
> > > static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
> > > enum maple_type type)
> > > {
> > > - return (void *)((unsigned long)node |
> > > - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
> > > + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
> > > + MAPLE_ENODE_NULL | MAPLE_ENODE);
> > > }
> > > static inline void *mte_mk_root(const struct maple_enode *node)
> > > @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
> > > return NULL;
> > > }
> > > +/*
> > > + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
> > > + * @mt: The maple tree
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/*
> > > + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum. This is the version of ma_nonleaf_data_end() that does not
> > > + * check for lock held. This particular version is designed to avoid lockdep
> > > + * complaining in some scenarios.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/* See ma_nonleaf_data_end() */
> > > +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_enode *enode)
> > > +{
> > > + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
> > > +}
> > > +
> > > /*
> > > * ma_data_end() - Find the end of the data in a node.
> > > * @node: The maple node
> > > --
> > > 2.20.1
> > >
> > >