Re: [PATCH] kretprobe scalability improvement

From: Masami Hiramatsu
Date: Tue Jul 06 2021 - 12:25:31 EST

Next message: Guenter Roeck: "[PATCH] riscv: Disable STACKPROTECTOR_PER_TASK if GCC_PLUGIN_RANDSTRUCT is enabled"
Previous message: Paul E. McKenney: "Re: [PATCH 2/2] rcu: Remove needless preemption disablement in rcu_all_qs()"
In reply to: Matt Wu: "Re: [PATCH] kretprobe scalability improvement"
Next in thread: Matt Wu: "Re: [PATCH] kretprobe scalability improvement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 6 Jul 2021 09:21:00 +0800
Matt Wu <wuqiang.matt@xxxxxxxxxxxxx> wrote:

> On 2021/7/5 PM2:59, Masami Hiramatsu wrote:
> > Hi,
> >
> > On Sat, 3 Jul 2021 18:28:18 +0800
> > "wuqiang.matt" <wuqiang.matt@xxxxxxxxxxxxx> wrote:
> >
> >> From: wuqiang <wuqiang.matt@xxxxxxxxxxxxx>
> >>
> >> The original freelist is a LIFO queue based on singly linked list, which lacks
> >> of scalability, and thus becomes bottleneck under high workloads. freelist was
> >> introduced by Masami Hiramatsu's work of removing kretprobe hash lock:
> >> url: https://lkml.org/lkml/2020/8/29/209.
> >>
> >> Here an array-based MPMC lockless queue is proposed. The solution of bounded
> >> array can nicely avoid ABA issue, while freelist or circular queue etc. have
> >> to perform 2 CAS loops. The other advantage is that order and fairness can be
> >> ignored, the only concern is to retrieve kretprobe instance records as fast
> >> as possible, i.e. performance and correctness. Tests of kretprobe on 96-CORE
> >> ARM64 show the biggest gain as 466.7x of the original freelist throughput.
> >> The raw queue throughput can be 1,975 times of freelist. Here are the results:
> >>
> >> Ubuntu 20.04, 5.13.0-rc6 (XEON E5-2660V3 2.4G, DDR4 2133MT/s, 10 CORES/20 THREADS):
> >> 1x 2x 4x 8x 10x 16x 20x 32x 40x
> >> freelist: 13086080 22493637 32773854 20129772 18455899 18435561 18980332 18988603 18991334
> >> array : 13144036 26059941 47449954 94517172 115856027 116414714 125692971 125553061 125685981
> >>
> >> Ubuntu 21.04 - 5.12.10 QEMU 96 CORES (HUAWEI TaiShan 2280V2 KP920 96 CORES 2.6G, DDR4 2944 MT/s):
> >> 1x 2x 4x 8x 16x 24x 48x 96x 192x
> >> freelist: 17,233,640 10,296,664 8,095,309 6,993,545 5,050,817 4,295,283 3,382,013 2,738,050 2,743,345
> >> array: 19,360,905 37,395,225 56,417,463 10,020,136 209,876,209 328,940,014 632,754,916 1,277,862,473 1,169,076,739
> >
> > Interesting result! How would you measure the overhead?
> > And also could you clarify the real scalability example of kretprobe usage ?
> > E.g. putting kretprobes at some function and profiling with perf. See following
> > slides for details.
> >
> > https://events.static.linuxfound.org/sites/events/files/slides/Handling%20the%20Massive%20Multiple%20Kprobes%20v2_1.pdf
> > (BTW, these efforts totally stalls a while, needs to be reviewed again)
>
> I did two kinds of tests: one is real kretprobe, the other is throughput
> comparison of different queue implementations.
>
> 1) kretprobe upon security_file_mprotect
>
> We found the performance bottleneck due to udp_recvmsg kretprobe in
> our production environment, then re-produced the issue with a lighter
> syscall: mprotect.
>
> "perf stat" is used to count number of sys_enter_mprotect calligs:
> perf stat -a -I 10000 -e 'syscalls:sys_enter_mprotect' vmstat 1 35
>
> The user mode program is just a loop of mprotect() to trigger the
> registered kretprobe callbacks. The codes are pushed to:
> https://github.com/mattwuq/kretprobe/blob/main/mprotect/
>
> I measured both kprobe and kretprobe for 4.14/5.9/5.12. The results
> of kprobe is really good, but kretprobe doesn't scale well (even for
> kernel 5.12 with "kprobes: Remove kretprobe hash").

Hmm, Ok if there is a real kretprobe issue (not freelist), it should
be solved. Could you also point this result from your changelog?

>
> 2) raw queue throughput benchmarks
>
> I wrote a module with dedicated kernel threads performing insertions
> and deletions of several freelist implementations for 10ms.
>
> The codes and test scripts are available at:
> https://github.com/mattwuq/kretprobe/blob/main/scalable/
>
> 1) fl.h: original freelist, LIFO queue based on singly linked list
> 2) ra.h: read from random position, write to last read pos
> 3) sa.h: array-based queue, per-cpu slot to be equally distributed
> 4) saca.h: the proposed version, allocating array with L1 cache line
> aligned for each core
> 5) saea.h: make every elelment cache_line aligned
> 6) zz.h: a.k.a zigzag, remap numerical order to L1 cache distance,
> for 64bit pointers, 0 to 0, 1 to 8, 2 to 16
> 7) cq.h: native circular queue, not used, can not handle reentrance
>
> Two types of tests are performanced:
> 1) throughput bench: with no delay between deletion and insertion
> 2) emulation bench of real kretprobe: 1us delay before inserting back
>
> All the results and charts are available at:
> https://github.com/mattwuq/kretprobe/tree/main/doc/
>

OK, this test report is also great :)

> >> So linear scalability is still not available, limited by the following two
> >> considerations:
> >>
> >> 1. keep it simple: best solution could be an implementation of per-cpu queues,
> >> but it's not applicable for this case due to complexity. After all for
> >> most cases the number of pre-allocated kretprobe instances (maxactive) is
> >> only a small value. If not specified by user during registering, maxactive
> >> is set as CPU cores or 2x when preemption is enabled
> >> 2. keep it compact: cache-line-alignment can solve false-sharing and minimize
> >> cache thrashing, but it introduces memory wasting, considering the small
> >> body of structure kretprobe_instance. Secondly the performance improvement
> >> of cache-line-aligned is not significant as expected
> >
> > If you really need the linear scalability, drop useless entry-handler and per
> > instance data (or just leave the data pointer) and allocate the instance pool
> > for each task struct. This is perfectly scalable, because kretprobe instance
> > is only for making a shadow stack for the task, not CPU.
>
> Yes, per-task list of kretprobe instances would deliver best throughput.
> But the penality could be high in memory efficency and implementations.

How much penalty it would make? If we allocate a 4kb pool for each task,
it would be enough small compared with other resources (and we may be
able to select the pool on-line or compile option)

> Inspired by your idea, I'm thinking of allocating from stack:
>
> 1) from stack top: need modify stack top limit, might “violate” the
> purpose of guard page
> 2）reserve from current stack: need modify trampolines of fltrace and
> kprobe, but there are many challenges.

No, I don't like this change because it will disturb the stack unwinder
and consuming the stack itself.

>
> >> With a pre-built kernel, further performance tuning can be done by increasing
> >> maxactive when registering kretprobe. Tests show 4x cores number is a fair
> >> choice for both performance and memory efficiency.
> >
> > Which test should I check? If it is also good for the current freelist,
> > I would like to expand default maxactive. (actually, current maxactive
> > is chosen by the minimum availability)
>
> I tested with difference maxactive values. For current freelist, bigger
> maxactive values have less effects upon performance.

So bigger 'maxactive' will scale better?

>
> "missed cases" was also tracked. Based on testings, so long as maxactive
> is more than cores number, there won't be "missed cases".

That depends on where you put the probe. kretprobe can be nested and
sleepable. If you put a kretprobe on the function which doesn't yield,
you don't need bigger maxactive. But kretprobe on the function which
can sleep or yield, you may need more than that.

> >>
> >> More info is available at: https://github.com/mattwuq/kretprobe
> >>
> >> Signed-off-by: wuqiang <wuqiang.matt@xxxxxxxxxxxxx>
> >> ---
> >> include/linux/freelist.h | 187 +++++++++++++++++++--------------------
> >> kernel/kprobes.c | 29 +++---
> >> 2 files changed, 107 insertions(+), 109 deletions(-)
> >>
> >> diff --git a/include/linux/freelist.h b/include/linux/freelist.h
> >> index fc1842b96469..3d4a0bc425b2 100644
> >> --- a/include/linux/freelist.h
> >> +++ b/include/linux/freelist.h
> >> @@ -1,129 +1,122 @@
> >> -/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
> >> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> >
> > Please do NOT change the license without the agreement of all copyright holders.
> > Or, add a new file and remove the current freelist.h.

What about this?

Thank you,

> >
> >> #ifndef FREELIST_H
> >> #define FREELIST_H
> >>
> >> +#include <linux/slab.h>
> >> #include <linux/atomic.h>
> >>
> >> /*
> >> - * Copyright: cameron@xxxxxxxxxxxxxx
> >> + * lockless queue for kretprobe instances
> >> *
> >> - * A simple CAS-based lock-free free list. Not the fastest thing in the world
> >> - * under heavy contention, but simple and correct (assuming nodes are never
> >> - * freed until after the free list is destroyed), and fairly speedy under low
> >> - * contention.
> >> - *
> >> - * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
> >> + * It's an array-based MPMC lockless queue, solely for better scalability
> >> + * and contention mitigation. It's simple in implementation and compact in
> >> + * memory efficiency. The only concern is to retrieve kretprobe instance
> >> + * records as fast as possible, with both order and fairness ignored.
> >> */
> >>
> >> struct freelist_node {
> >> - atomic_t refs;
> >> - struct freelist_node *next;
> >> + struct freelist_node *next;
> >> };
> >> -
> >> struct freelist_head {
> >> - struct freelist_node *head;
> >> + uint32_t fh_size; /* rounded to power of 2 */
> >> + uint32_t fh_mask; /* (fh_size - 1) */
> >> + uint16_t fh_bits; /* log2(fh_size) */
> >> + uint16_t fh_step; /* per-core shift stride */
> >> + uint32_t fh_used; /* num of elements in list */
> >> + struct freelist_node **fh_ents; /* array for krp instances */
> >> };
> >>
> >> -#define REFS_ON_FREELIST 0x80000000
> >> -#define REFS_MASK 0x7FFFFFFF
> >> +static inline int freelist_init(struct freelist_head *list, int max)
> >> +{
> >> + uint32_t size, cores = num_possible_cpus();
> >> +
> >> + list->fh_used = 0;
> >> + list->fh_step = ilog2(L1_CACHE_BYTES / sizeof(void *));
> >> + if (max < (cores << list->fh_step))
> >> + list->fh_size = roundup_pow_of_two(cores) << list->fh_step;
> >> + else
> >> + list->fh_size = roundup_pow_of_two(max);
> >> + list->fh_mask = list->fh_size - 1;
> >> + list->fh_bits = (uint16_t)ilog2(list->fh_size);
> >> + size = list->fh_size * sizeof(struct freelist_node *);
> >> + list->fh_ents = kzalloc(size, GFP_KERNEL);
> >> + if (!list->fh_ents)
> >> + return -ENOMEM;
> >> +
> >> + return 0;
> >> +}
> >>
> >> -static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> >> +static inline int freelist_try_add(struct freelist_node *node, struct freelist_head *list)
> >> {
> >> - /*
> >> - * Since the refcount is zero, and nobody can increase it once it's
> >> - * zero (except us, and we run only one copy of this method per node at
> >> - * a time, i.e. the single thread case), then we know we can safely
> >> - * change the next pointer of the node; however, once the refcount is
> >> - * back above zero, then other threads could increase it (happens under
> >> - * heavy contention, when the refcount goes to zero in between a load
> >> - * and a refcount increment of a node in try_get, then back up to
> >> - * something non-zero, then the refcount increment is done by the other
> >> - * thread) -- so if the CAS to add the node to the actual list fails,
> >> - * decrese the refcount and leave the add operation to the next thread
> >> - * who puts the refcount back to zero (which could be us, hence the
> >> - * loop).
> >> - */
> >> - struct freelist_node *head = READ_ONCE(list->head);
> >> -
> >> - for (;;) {
> >> - WRITE_ONCE(node->next, head);
> >> - atomic_set_release(&node->refs, 1);
> >> -
> >> - if (!try_cmpxchg_release(&list->head, &head, node)) {
> >> - /*
> >> - * Hmm, the add failed, but we can only try again when
> >> - * the refcount goes back to zero.
> >> - */
> >> - if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> >> - continue;
> >> + uint32_t i, hint = list->fh_used << list->fh_step;
> >> +
> >> + for (i = 0; i < list->fh_size; i++) {
> >> + struct freelist_node *item = NULL;
> >> + uint32_t slot = (i + hint) & list->fh_mask;
> >> + if (try_cmpxchg_release(&list->fh_ents[slot], &item, node)) {
> >> + list->fh_used++;
> >> + break;
> >> }
> >> - return;
> >> }
> >> +
> >> + return (i >= list->fh_size);
> >> }
> >>
> >> -static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
> >> +static inline int freelist_add(struct freelist_node *node, struct freelist_head *list)
> >> {
> >> - /*
> >> - * We know that the should-be-on-freelist bit is 0 at this point, so
> >> - * it's safe to set it using a fetch_add.
> >> - */
> >> - if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
> >> - /*
> >> - * Oh look! We were the last ones referencing this node, and we
> >> - * know we want to add it to the free list, so let's do it!
> >> - */
> >> - __freelist_add(node, list);
> >> - }
> >> + uint32_t hint = raw_smp_processor_id() << list->fh_step;
> >> + uint32_t slot = ((uint32_t) hint) & list->fh_mask;
> >> +
> >> + do {
> >> + struct freelist_node *item = NULL;
> >> + if (try_cmpxchg_release(&list->fh_ents[slot], &item, node))
> >> + return 0;
> >> + slot = (slot + 1) & list->fh_mask;
> >> + } while (1);
> >> +
> >> + return -1;
> >> }
> >>
> >> static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> >> {
> >> - struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> >> - unsigned int refs;
> >> -
> >> - while (head) {
> >> - prev = head;
> >> - refs = atomic_read(&head->refs);
> >> - if ((refs & REFS_MASK) == 0 ||
> >> - !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> >> - head = smp_load_acquire(&list->head);
> >> - continue;
> >> + struct freelist_node *node = NULL;
> >> + uint32_t i, hint = raw_smp_processor_id() << list->fh_step;
> >> +
> >> + for (i = 0; i < list->fh_size; i++) {
> >> + uint32_t slot = (hint + i) & list->fh_mask;
> >> + struct freelist_node *item = smp_load_acquire(&list->fh_ents[slot]);
> >> + if (item && try_cmpxchg_release(&list->fh_ents[slot], &item, NULL)) {
> >> + node = item;
> >> + break;
> >> }
> >> + }
> >>
> >> - /*
> >> - * Good, reference count has been incremented (it wasn't at
> >> - * zero), which means we can read the next and not worry about
> >> - * it changing between now and the time we do the CAS.
> >> - */
> >> - next = READ_ONCE(head->next);
> >> - if (try_cmpxchg_acquire(&list->head, &head, next)) {
> >> - /*
> >> - * Yay, got the node. This means it was on the list,
> >> - * which means should-be-on-freelist must be false no
> >> - * matter the refcount (because nobody else knows it's
> >> - * been taken off yet, it can't have been put back on).
> >> - */
> >> - WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> >> -
> >> - /*
> >> - * Decrease refcount twice, once for our ref, and once
> >> - * for the list's ref.
> >> - */
> >> - atomic_fetch_add(-2, &head->refs);
> >> -
> >> - return head;
> >> - }
> >> + return node;
> >> +}
> >>
> >> - /*
> >> - * OK, the head must have changed on us, but we still need to decrement
> >> - * the refcount we increased.
> >> - */
> >> - refs = atomic_fetch_add(-1, &prev->refs);
> >> - if (refs == REFS_ON_FREELIST + 1)
> >> - __freelist_add(prev, list);
> >> +static inline void freelist_destroy(struct freelist_head *list, void *context,
> >> + int (*release)(void *, void *))
> >> +{
> >> + uint32_t i;
> >> +
> >> + if (!list->fh_ents)
> >> + return;
> >> +
> >> + for (i = 0; i < list->fh_size; i++) {
> >> + uint32_t slot = i & list->fh_mask;
> >> + struct freelist_node *item = smp_load_acquire(&list->fh_ents[slot]);
> >> + while (item) {
> >> + if (try_cmpxchg_release(&list->fh_ents[slot], &item, NULL)) {
> >> + if (release)
> >> + release(context, item);
> >> + break;
> >> + }
> >> + }
> >> }
> >>
> >> - return NULL;
> >> + if (list->fh_ents) {
> >> + kfree(list->fh_ents);
> >> + list->fh_ents = NULL;
> >> + }
> >> }
> >> -
> >> #endif /* FREELIST_H */
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> >> index 471b1d18a92f..5c41bee25983 100644
> >> --- a/kernel/kprobes.c
> >> +++ b/kernel/kprobes.c
> >> @@ -1277,20 +1277,21 @@ void kprobe_flush_task(struct task_struct *tk)
> >> }
> >> NOKPROBE_SYMBOL(kprobe_flush_task);
> >>
> >> -static inline void free_rp_inst(struct kretprobe *rp)
> >> +static int release_ri(void *context, void *node)
> >> {
> >> struct kretprobe_instance *ri;
> >> - struct freelist_node *node;
> >> - int count = 0;
> >> + ri = container_of(node, struct kretprobe_instance, freelist);
> >> + kfree(ri);
> >> + if (context)
> >> + (*((int *)context))++;
> >> + return 0;
> >> +}
> >>
> >> - node = rp->freelist.head;
> >> - while (node) {
> >> - ri = container_of(node, struct kretprobe_instance, freelist);
> >> - node = node->next;
> >> +static inline void free_rp_inst(struct kretprobe *rp)
> >> +{
> >> + int count = 0;
> >>
> >> - kfree(ri);
> >> - count++;
> >> - }
> >> + freelist_destroy(&rp->freelist, &count, release_ri);
> >>
> >> if (refcount_sub_and_test(count, &rp->rph->ref)) {
> >> kfree(rp->rph);
> >> @@ -2015,10 +2016,14 @@ int register_kretprobe(struct kretprobe *rp)
> >> rp->maxactive = num_possible_cpus();
> >> #endif
> >> }
> >> - rp->freelist.head = NULL;
> >> + if (freelist_init(&rp->freelist, rp->maxactive))
> >> + return -ENOMEM;
> >> +
> >> rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
> >> - if (!rp->rph)
> >> + if (!rp->rph) {
> >> + freelist_destroy(&rp->freelist, NULL, NULL);
> >> return -ENOMEM;
> >> + }
> >>
> >> rp->rph->rp = rp;
> >> for (i = 0; i < rp->maxactive; i++) {
> >> --
> >> 2.25.1

--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>

Next message: Guenter Roeck: "[PATCH] riscv: Disable STACKPROTECTOR_PER_TASK if GCC_PLUGIN_RANDSTRUCT is enabled"
Previous message: Paul E. McKenney: "Re: [PATCH 2/2] rcu: Remove needless preemption disablement in rcu_all_qs()"
In reply to: Matt Wu: "Re: [PATCH] kretprobe scalability improvement"
Next in thread: Matt Wu: "Re: [PATCH] kretprobe scalability improvement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]