Re: [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events

From: Pengfei Xu
Date: Tue Feb 20 2024 - 03:46:59 EST


Hi Ravi Bangoria,

On 2023-05-04 at 16:30:00 +0530, Ravi Bangoria wrote:
> Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
> cpu-clock events are interfaced through it but internally gets forwarded
> to their own pmus.
>
> Rework this by overwriting event->attr.type in perf_swevent_init() which
> will cause perf_init_event() to retry with updated type and event will
> automatically get forwarded to right pmu. With the change, SW pmu no
> longer needs to be treated specially and can be included in 'pmu_idr'
> list.
>
> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@xxxxxxx>
> ---
> include/linux/perf_event.h | 10 +++++
> kernel/events/core.c | 77 ++++++++++++++++++++------------------
> 2 files changed, 51 insertions(+), 36 deletions(-)

Greeting!
There is task hung in perf_tp_event_init in v6.8-rc4 in guest.

All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240219_055325_perf_tp_event_init
Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/repro.c
Syzkaller reproduced syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/repro.prog
Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/bisect_info.log
v6.8-rc4 issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/841c35169323cd833294798e58b9bf63fa4fa1de_dmesg.log
v6.8-rc4 bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240219_055325_perf_tp_event_init/bzimage_v6.8-rc4.tar.gz

Bisected and found bad commit:
0d6d062ca27e perf/core: Rework forwarding of {task|cpu}-clock events

It could be reproduced in 3600s.
After reverted the above commit on top of v6.8-rc4 kernel, this issue was gone.

Syzkaller repro.report:
"
clocksource: Long readout interval, skipping watchdog check: cs_nsec: 2247936849 wd_nsec: 2247937071
INFO: task syz-executor703:819 blocked for more than 143 seconds.
Not tainted 6.8.0-rc4-4e6c5b0f4ced+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor703 state:D stack:0 pid:819 tgid:819 ppid:760 flags:0x00000004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5400 [inline]
__schedule+0xa7d/0x2810 kernel/sched/core.c:6727
__schedule_loop kernel/sched/core.c:6802 [inline]
schedule+0xd0/0x290 kernel/sched/core.c:6817
schedule_preempt_disabled+0x1c/0x30 kernel/sched/core.c:6874
__mutex_lock_common kernel/locking/mutex.c:684 [inline]
__mutex_lock+0xf63/0x1b50 kernel/locking/mutex.c:752
mutex_lock_nested+0x1f/0x30 kernel/locking/mutex.c:804
perf_trace_init+0x5c/0x310 kernel/trace/trace_event_perf.c:221
perf_tp_event_init+0xaf/0x130 kernel/events/core.c:10106
perf_try_init_event+0x13f/0x5a0 kernel/events/core.c:11672
perf_init_event kernel/events/core.c:11742 [inline]
perf_event_alloc kernel/events/core.c:12022 [inline]
perf_event_alloc+0x1069/0x3e40 kernel/events/core.c:11888
__do_sys_perf_event_open+0x48e/0x2c40 kernel/events/core.c:12529
__se_sys_perf_event_open kernel/events/core.c:12420 [inline]
__x64_sys_perf_event_open+0xc7/0x160 kernel/events/core.c:12420
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x74/0x150 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x45c36d
RSP: 002b:00007fffec7b5478 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045c36d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000080
RBP: 00000000000b8995 R08: 0000000000000000 R09: 000000000000031a
R10: 00000000ffffffff R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000000 R14: 00000000004f9018 R15: 0000000000000000
</TASK>
INFO: task syz-executor703:821 blocked for more than 143 seconds.
Not tainted 6.8.0-rc4-4e6c5b0f4ced+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor703 state:D stack:0 pid:821 tgid:821 ppid:767 flags:0x00000004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5400 [inline]
__schedule+0xa7d/0x2810 kernel/sched/core.c:6727
__schedule_loop kernel/sched/core.c:6802 [inline]
schedule+0xd0/0x290 kernel/sched/core.c:6817
schedule_preempt_disabled+0x1c/0x30 kernel/sched/core.c:6874
__mutex_lock_common kernel/locking/mutex.c:684 [inline]
__mutex_lock+0xf63/0x1b50 kernel/locking/mutex.c:752
mutex_lock_nested+0x1f/0x30 kernel/locking/mutex.c:804
perf_trace_init+0x5c/0x310 kernel/trace/trace_event_perf.c:221
perf_tp_event_init+0xaf/0x130 kernel/events/core.c:10106
perf_try_init_event+0x13f/0x5a0 kernel/events/core.c:11672
perf_init_event kernel/events/core.c:11742 [inline]
perf_event_alloc kernel/events/core.c:12022 [inline]
perf_event_alloc+0x1069/0x3e40 kernel/events/core.c:11888
__do_sys_perf_event_open+0x48e/0x2c40 kernel/events/core.c:12529
__se_sys_perf_event_open kernel/events/core.c:12420 [inline]
__x64_sys_perf_event_open+0xc7/0x160 kernel/events/core.c:12420
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x74/0x150 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x45c36d
RSP: 002b:00007fffec7b5478 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045c36d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000080
RBP: 00000000000b8bea R08: 0000000000000000 R09: 000000000000031e
R10: 00000000ffffffff R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000000 R14: 00000000004f9018 R15: 0000000000000000
"

Thanks!

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
./configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Best Regards,
Thanks!

>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..5c8a748f51ac 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -295,6 +295,8 @@ struct perf_event_pmu_context;
>
> struct perf_output_handle;
>
> +#define PMU_NULL_DEV ((void *)(~0))
> +
> /**
> * struct pmu - generic performance monitoring unit
> */
> @@ -827,6 +829,14 @@ struct perf_event {
> void *security;
> #endif
> struct list_head sb_list;
> +
> + /*
> + * Certain events gets forwarded to another pmu internally by over-
> + * writing kernel copy of event->attr.type without user being aware
> + * of it. event->orig_type contains original 'type' requested by
> + * user.
> + */
> + __u32 orig_type;
> #endif /* CONFIG_PERF_EVENTS */
> };
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 435815d3be3f..0695bb9fbbb6 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6647,7 +6647,7 @@ static void perf_sigtrap(struct perf_event *event)
> return;
>
> send_sig_perf((void __user *)event->pending_addr,
> - event->attr.type, event->attr.sig_data);
> + event->orig_type, event->attr.sig_data);
> }
>
> /*
> @@ -9951,6 +9951,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
> swevent_hlist_put();
> }
>
> +static struct pmu perf_cpu_clock; /* fwd declaration */
> +static struct pmu perf_task_clock;
> +
> static int perf_swevent_init(struct perf_event *event)
> {
> u64 event_id = event->attr.config;
> @@ -9966,7 +9969,10 @@ static int perf_swevent_init(struct perf_event *event)
>
> switch (event_id) {
> case PERF_COUNT_SW_CPU_CLOCK:
> + event->attr.type = perf_cpu_clock.type;
> + return -ENOENT;
> case PERF_COUNT_SW_TASK_CLOCK:
> + event->attr.type = perf_task_clock.type;
> return -ENOENT;
>
> default:
> @@ -11086,7 +11092,7 @@ static void cpu_clock_event_read(struct perf_event *event)
>
> static int cpu_clock_event_init(struct perf_event *event)
> {
> - if (event->attr.type != PERF_TYPE_SOFTWARE)
> + if (event->attr.type != perf_cpu_clock.type)
> return -ENOENT;
>
> if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
> @@ -11107,6 +11113,7 @@ static struct pmu perf_cpu_clock = {
> .task_ctx_nr = perf_sw_context,
>
> .capabilities = PERF_PMU_CAP_NO_NMI,
> + .dev = PMU_NULL_DEV,
>
> .event_init = cpu_clock_event_init,
> .add = cpu_clock_event_add,
> @@ -11167,7 +11174,7 @@ static void task_clock_event_read(struct perf_event *event)
>
> static int task_clock_event_init(struct perf_event *event)
> {
> - if (event->attr.type != PERF_TYPE_SOFTWARE)
> + if (event->attr.type != perf_task_clock.type)
> return -ENOENT;
>
> if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
> @@ -11188,6 +11195,7 @@ static struct pmu perf_task_clock = {
> .task_ctx_nr = perf_sw_context,
>
> .capabilities = PERF_PMU_CAP_NO_NMI,
> + .dev = PMU_NULL_DEV,
>
> .event_init = task_clock_event_init,
> .add = task_clock_event_add,
> @@ -11415,31 +11423,31 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> goto unlock;
>
> pmu->type = -1;
> - if (!name)
> - goto skip_type;
> + if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
> + ret = -EINVAL;
> + goto free_pdc;
> + }
> +
> pmu->name = name;
>
> - if (type != PERF_TYPE_SOFTWARE) {
> - if (type >= 0)
> - max = type;
> + if (type >= 0)
> + max = type;
>
> - ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
> - if (ret < 0)
> - goto free_pdc;
> + ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto free_pdc;
>
> - WARN_ON(type >= 0 && ret != type);
> + WARN_ON(type >= 0 && ret != type);
>
> - type = ret;
> - }
> + type = ret;
> pmu->type = type;
>
> - if (pmu_bus_running) {
> + if (pmu_bus_running && !pmu->dev) {
> ret = pmu_dev_alloc(pmu);
> if (ret)
> goto free_idr;
> }
>
> -skip_type:
> ret = -ENOMEM;
> pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
> if (!pmu->cpu_pmu_context)
> @@ -11481,16 +11489,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> if (!pmu->event_idx)
> pmu->event_idx = perf_event_idx_default;
>
> - /*
> - * Ensure the TYPE_SOFTWARE PMUs are at the head of the list,
> - * since these cannot be in the IDR. This way the linear search
> - * is fast, provided a valid software event is provided.
> - */
> - if (type == PERF_TYPE_SOFTWARE || !name)
> - list_add_rcu(&pmu->entry, &pmus);
> - else
> - list_add_tail_rcu(&pmu->entry, &pmus);
> -
> + list_add_rcu(&pmu->entry, &pmus);
> atomic_set(&pmu->exclusive_cnt, 0);
> ret = 0;
> unlock:
> @@ -11499,12 +11498,13 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> return ret;
>
> free_dev:
> - device_del(pmu->dev);
> - put_device(pmu->dev);
> + if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
> + device_del(pmu->dev);
> + put_device(pmu->dev);
> + }
>
> free_idr:
> - if (pmu->type != PERF_TYPE_SOFTWARE)
> - idr_remove(&pmu_idr, pmu->type);
> + idr_remove(&pmu_idr, pmu->type);
>
> free_pdc:
> free_percpu(pmu->pmu_disable_count);
> @@ -11525,9 +11525,8 @@ void perf_pmu_unregister(struct pmu *pmu)
> synchronize_rcu();
>
> free_percpu(pmu->pmu_disable_count);
> - if (pmu->type != PERF_TYPE_SOFTWARE)
> - idr_remove(&pmu_idr, pmu->type);
> - if (pmu_bus_running) {
> + idr_remove(&pmu_idr, pmu->type);
> + if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
> if (pmu->nr_addr_filters)
> device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
> device_del(pmu->dev);
> @@ -11601,6 +11600,12 @@ static struct pmu *perf_init_event(struct perf_event *event)
>
> idx = srcu_read_lock(&pmus_srcu);
>
> + /*
> + * Save original type before calling pmu->event_init() since certain
> + * pmus overwrites event->attr.type to forward event to another pmu.
> + */
> + event->orig_type = event->attr.type;
> +
> /* Try parent's PMU first: */
> if (event->parent && event->parent->pmu) {
> pmu = event->parent->pmu;
> @@ -13640,8 +13645,8 @@ void __init perf_event_init(void)
> perf_event_init_all_cpus();
> init_srcu_struct(&pmus_srcu);
> perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
> - perf_pmu_register(&perf_cpu_clock, NULL, -1);
> - perf_pmu_register(&perf_task_clock, NULL, -1);
> + perf_pmu_register(&perf_cpu_clock, "cpu_clock", -1);
> + perf_pmu_register(&perf_task_clock, "task_clock", -1);
> perf_tp_register();
> perf_event_init_cpu(smp_processor_id());
> register_reboot_notifier(&perf_reboot_notifier);
> @@ -13684,7 +13689,7 @@ static int __init perf_event_sysfs_init(void)
> goto unlock;
>
> list_for_each_entry(pmu, &pmus, entry) {
> - if (!pmu->name || pmu->type < 0)
> + if (pmu->dev)
> continue;
>
> ret = pmu_dev_alloc(pmu);
> --
> 2.40.0
>