Re: [PATCH bpf-next v2] docs/bpf: Add LRU internals description and graph

From: Bagas Sanjaya
Date: Fri Nov 04 2022 - 05:31:13 EST


On 11/4/22 03:50, Joe Stringer wrote:
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> +.. flat-table:: Comparison of map properties by map type (x-axis) and flags
> + (y-axis)
> +
> + * -
> + - ``BPF_MAP_TYPE_LRU_HASH``
> + - ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
> +
> + * - ``BPF_NO_COMMON_LRU``
> + - Per-CPU LRU, global map
> + - Per-CPU LRU, per-cpu map
> +
> + * - ``!BPF_NO_COMMON_LRU``
> + - Global LRU, global map
> + - Global LRU, per-cpu map
> +

Shouldn't the table be written in reST table syntax instead?

> +The commit message for LRU map support provides a general overview of the
> +underlying LRU algorithm used for entry eviction when the table is full:
> +
> +::
> +
> + commit 3a08c2fd763450a927d1130de078d6f9e74944fb
> + Author: Martin KaFai Lau <kafai@xxxxxx>
> + Date: Fri Nov 11 10:55:06 2016 -0800
> +
> + bpf: LRU List
> +
> + Introduce bpf_lru_list which will provide LRU capability to
> + the bpf_htab in the later patch.
> +
> + * General Thoughts:
> + 1. Target use case. Read is more often than update.
> + (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
> + If bpf_prog does a bpf_lookup_elem() first and then an in-place
> + update, it still counts as a read operation to the LRU list concern.
> + 2. It may be useful to think of it as a LRU cache
> + 3. Optimize the read case
> + 3.1 No lock in read case
> + 3.2 The LRU maintenance is only done during bpf_update_elem()
> + 4. If there is a percpu LRU list, it will lose the system-wise LRU
> + property. A completely isolated percpu LRU list has the best
> + performance but the memory utilization is not ideal considering
> + the work load may be imbalance.
> + 5. Hence, this patch starts the LRU implementation with a global LRU
> + list with batched operations before accessing the global LRU list.
> + As a LRU cache, #read >> #update/#insert operations, it will work well.
> + 6. There is a local list (for each cpu) which is named
> + 'struct bpf_lru_locallist'. This local list is not used to sort
> + the LRU property. Instead, the local list is to batch enough
> + operations before acquiring the lock of the global LRU list. More
> + details on this later.
> + 7. In the later patch, it allows a percpu LRU list by specifying a
> + map-attribute for scalability reason and for use cases that need to
> + prepare for the worst (and pathological) case like DoS attack.
> + The percpu LRU list is completely isolated from each other and the
> + LRU nodes (including free nodes) cannot be moved across the list. The
> + following description is for the global LRU list but mostly applicable
> + to the percpu LRU list also.
> +
> + * Global LRU List:
> + 1. It has three sub-lists: active-list, inactive-list and free-list.
> + 2. The two list idea, active and inactive, is borrowed from the
> + page cache.
> + 3. All nodes are pre-allocated and all sit at the free-list (of the
> + global LRU list) at the beginning. The pre-allocation reasoning
> + is similar to the existing BPF_MAP_TYPE_HASH. However,
> + opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
> + the LRU map.
> +
> + * Active/Inactive List (of the global LRU list):
> + 1. The active list, as its name says it, maintains the active set of
> + the nodes. We can think of it as the working set or more frequently
> + accessed nodes. The access frequency is approximated by a ref-bit.
> + The ref-bit is set during the bpf_lookup_elem().
> + 2. The inactive list, as its name also says it, maintains a less
> + active set of nodes. They are the candidates to be removed
> + from the bpf_htab when we are running out of free nodes.
> + 3. The ordering of these two lists is acting as a rough clock.
> + The tail of the inactive list is the older nodes and
> + should be released first if the bpf_htab needs free element.
> +
> + * Rotating the Active/Inactive List (of the global LRU list):
> + 1. It is the basic operation to maintain the LRU property of
> + the global list.
> + 2. The active list is only rotated when the inactive list is running
> + low. This idea is similar to the current page cache.
> + Inactive running low is currently defined as
> + "# of inactive < # of active".
> + 3. The active list rotation always starts from the tail. It moves
> + node without ref-bit set to the head of the inactive list.
> + It moves node with ref-bit set back to the head of the active
> + list and then clears its ref-bit.
> + 4. The inactive rotation is pretty simply.
> + It walks the inactive list and moves the nodes back to the head of
> + active list if its ref-bit is set. The ref-bit is cleared after moving
> + to the active list.
> + If the node does not have ref-bit set, it just leave it as it is
> + because it is already in the inactive list.
> +
> + * Shrinking the Inactive List (of the global LRU list):
> + 1. Shrinking is the operation to get free nodes when the bpf_htab is
> + full.
> + 2. It usually only shrinks the inactive list to get free nodes.
> + 3. During shrinking, it will walk the inactive list from the tail,
> + delete the nodes without ref-bit set from bpf_htab.
> + 4. If no free node found after step (3), it will forcefully get
> + one node from the tail of inactive or active list. Forcefully is
> + in the sense that it ignores the ref-bit.
> +
> + * Local List:
> + 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
> + batch enough operations before acquiring the lock of the
> + global LRU.
> + 2. A local list has two sub-lists, free-list and pending-list.
> + 3. During bpf_update_elem(), it will try to get from the free-list
> + of (the current CPU local list).
> + 4. If the local free-list is empty, it will acquire from the
> + global LRU list. The global LRU list can either satisfy it
> + by its global free-list or by shrinking the global inactive
> + list. Since we have acquired the global LRU list lock,
> + it will try to get at most LOCAL_FREE_TARGET elements
> + to the local free list.
> + 5. When a new element is added to the bpf_htab, it will
> + first sit at the pending-list (of the local list) first.
> + The pending-list will be flushed to the global LRU list
> + when it needs to acquire free nodes from the global list
> + next time.
> +
> + * Lock Consideration:
> + The LRU list has a lock (lru_lock). Each bucket of htab has a
> + lock (buck_lock). If both locks need to be acquired together,
> + the lock order is always lru_lock -> buck_lock and this only
> + happens in the bpf_lru_list.c logic.
> +
> + In hashtab.c, both locks are not acquired together (i.e. one
> + lock is always released first before acquiring another lock).
> +
> + Signed-off-by: Martin KaFai Lau <kafai@xxxxxx>
> + Acked-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> + Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> +

What about just writing the pointer ("See commit 3a08c2fd7634 ("bpf: LRU List")")
instead?

> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the operations:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +

Better say "... other CPUs involved in the following operation attempts:"

> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.
> +
> +This algorithm is described visually in the following diagram:
> +
> +.. kernel-figure:: map_lru_hash_update.dot
> + :alt: Diagram outlining the LRU eviction steps taken during map update
> +
> + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> + variants
> +
<snipped>...
> +
> +The dot file source for the above diagram is uses internal kernel function
> +names for the node names in order to make the corresponding logic easier to
> +find. See ``Documentation/bpf/map_lru_hash_update.dot`` for more details.

Since it references the same figure, just say "See the figure above for more
details".

Otherwise LGTM, thanks.

--
An old man doll... just what I always wanted! - Clara