Re: [PATCH] user_events: Enable user processes to create and write to trace events

From: Steven Rostedt
Date: Wed Oct 06 2021 - 12:54:45 EST


On Tue, 5 Oct 2021 15:44:28 -0700
Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:

> User mode processes that wish to use trace events to get data into
> ftrace, perf, eBPF, etc are limited to uprobes today. The user events
> features enables an ABI for user mode processes to create and write to
> trace events that are isolated from kernel level trace events. This
> enables a faster path for tracing from user mode data as well as opens
> managed code to participate in trace events, where stub locations are
> dynamic.
>
> User processes often want to trace only when it's useful. To enable this
> a set of pages are mapped into the user process space that indicate the
> current state of the user events that have been registered. User
> processes can check if their event is hooked to a trace/probe, and if it
> is, emit the event data out via the write() syscall.
>
> Two files are added to tracefs to accomplish this:
> user_events_mmap - This file is mmap'd into participating user mode
> processes to indicate event status.
>
> user_events_data - This file is opened and register/delete ioctl's are
> issued to create/open/delete trace events that can be used for tracing.
>
> The typical scenario is on process start to mmap user_events_mmap. Processes
> then register the events they plan to use via the REG ioctl. The return value
> of the ioctl indicates the byte in the mmap to use for status. The file that
> was used for the ioctl is now accepting data via write() to emit out into the
> trace event.
>
> Psuedo code example of typical usage:
> page_fd = open("user_events_mmap", O_RDWR);
> page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
>
> data_fd = open("user_events_data", O_RDWR);
> data_id = ioctl(data_fd, DIAG_IOCSREG, "test");
>
> if (page_data[data_id]) write(data_fd, &payload, sizeof(payload));

What is the type of "page_data". I'd like to test it before accepting it.

>From playing around, I see that page_data is of type char *.


>
> User events are also exposed via the dynamic_events tracefs file for
> both create, delete and current status.
>
> Simple example to register a user event via dynamic_events and get status:
> echo ue:test >> dynamic_events
> cat dynamic_events
> ue:test
>
> If an event is hooked to a probe, the probe hooked shows up:
> echo 1 > events/user_events/test/enable
> cat dynamic_events
> ue:test (Used by ftrace)
>
> Users can describe the trace event format via the following format:
> name[;field1;field2]
>
> Each field has the following format:
> type\tname\tsize\toffset

BTW, the format should follow the way other probes are created. That is,
having a space between the name and field, and not a semicolon.

>
> Example for char array with a size of 20 named msg:
> echo -e 'ue:detailed;char[]\tmsg\t20\t0' >> dynamic_events
> cat dynamic_events
> ue:detailed;char[] msg
>
> Data offsets are based on the data written out via write() and will be
> updated to reflect the correct offset in the trace_event fields. __data_loc
> types must be aware of the size of trace_entry/common properties to ensure
> proper decoding. An ioctl is provided that enables user mode processes that
> only have access to user_events_data that returns the correct offset to use
> within the data payload (nothing needs to be done on registration).
>
> The above format is valid for both the ioctl and the dynamic_events file.
>
> Signed-off-by: Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/trace/Kconfig | 15 +
> kernel/trace/Makefile | 1 +
> kernel/trace/trace_events_user.c | 845 +++++++++++++++++++++++++++++++
> 3 files changed, 861 insertions(+)
> create mode 100644 kernel/trace/trace_events_user.c
>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 3ee23f4d437f..deaaad421be4 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -725,6 +725,21 @@ config SYNTH_EVENTS
>
> If in doubt, say N.
>
> +config USER_EVENTS
> + bool "User trace events"
> + select TRACING
> + select DYNAMIC_EVENTS
> + default n
> + help
> + User trace events are user-defined trace events that
> + can be used like an existing kernel trace event. User trace
> + events are generated by writing to a tracefs file. User
> + processes can determine if their tracing events should be
> + generated by memory mapping a tracefs file and checking for
> + an associated byte being non-zero.
> +
> + If in doubt, say N.
> +
> config HIST_TRIGGERS
> bool "Histogram triggers"
> depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index b1c47ccf4f73..a653b255e89c 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -80,6 +80,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
> obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o
> obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o
> obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o
> +obj-$(CONFIG_USER_EVENTS) += trace_events_user.o
> obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
> obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o
> obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> new file mode 100644
> index 000000000000..9afa72b55fa8
> --- /dev/null
> +++ b/kernel/trace/trace_events_user.c
> @@ -0,0 +1,845 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021, Microsoft Corporation.
> + *
> + * Authors:
> + * Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx>
> + */
> +
> +#include <linux/bitmap.h>
> +#include <linux/cdev.h>
> +#include <linux/hashtable.h>
> +#include <linux/io.h>
> +#include <linux/ioctl.h>
> +#include <linux/jhash.h>
> +#include <linux/trace_events.h>
> +#include <linux/tracefs.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include "trace.h"
> +#include "trace_dynevent.h"
> +
> +#define USER_EVENTS_SYSTEM "user_events"
> +#define USER_EVENTS_PREFIX "ue:"
> +#define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1)
> +
> +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */
> +#define EVENT_BIT_FTRACE 0
> +#define EVENT_BIT_PERF 1
> +#define EVENT_BIT_OTHER 7
> +
> +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE)
> +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF)
> +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER)
> +
> +#define FIELD_DEPTH_TYPE 0
> +#define FIELD_DEPTH_NAME 1
> +#define FIELD_DEPTH_SIZE 2
> +#define FIELD_DEPTH_OFFSET 3
> +
> +/*
> + * Limits how many trace_event calls user processes can create:
> + * Must be multiple of PAGE_SIZE.
> + */
> +#define MAX_PAGES 1
> +#define MAX_EVENTS (MAX_PAGES * PAGE_SIZE)
> +
> +/* Limit how long of an event name plus args within the subsystem. */
> +#define MAX_EVENT_DESC 512
> +#define EVENT_NAME(user_event) ((user_event)->tracepoint.name)
> +
> +#define DIAG_IOC_MAGIC '*'
> +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*)
> +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*)
> +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2)

These obviously will need to go into a user abi header file.

> +
> +static char *register_page_data;
> +
> +static DEFINE_HASHTABLE(register_table, 4);
> +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS);
> +
> +struct user_event {
> + struct tracepoint tracepoint;
> + struct trace_event_call call;
> + struct trace_event_class class;
> + struct dyn_event devent;
> + struct hlist_node node;
> + atomic_t refs;
> + int index;
> + char *args;
> +};
> +
> +#ifdef CONFIG_PERF_EVENTS
> +struct user_bpf_context {
> + int udatalen;
> + const char __user *udata;
> +};
> +#endif
> +
> +typedef void (*user_event_func_t) (struct user_event *user,
> + const char __user *udata,
> + size_t udatalen, void *tpdata);
> +
> +static int register_user_event(char *name, char *args,
> + struct user_event **newuser);
> +

[..]

> +static int __init trace_events_user_init(void)
> +{
> + int ret;
> +
> + /* Zero all bits beside 0 (which is reserved for failures) */
> + bitmap_zero(page_bitmap, MAX_EVENTS);
> + set_bit(0, page_bitmap);
> +
> + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL);

You want "kzalloc" here. Because when I read the map without adding
anything, I get:

printf("%lx\n", *(unsigned long *)page_data);

Produces:

ffffffff9065004e

But if I convert it to kzalloc() it gives me:

0

Thus, you are exposing stale memory. If you want to expose this to
non-admin users, this is a major security leak.

-- Steve

> +
> + if (!register_page_data)
> + return -ENOMEM;
> +
> + set_page_reservations(true);
> +
> + ret = create_user_tracefs();
> +
> + if (ret) {
> + pr_warn("user_events could not register with tracefs\n");
> + set_page_reservations(false);
> + kfree(register_page_data);
> + return ret;
> + }
> +
> + if (dyn_event_register(&user_event_dops))
> + pr_warn("user_events could not register with dyn_events\n");
> +
> + return 0;
> +}
> +
> +fs_initcall(trace_events_user_init);