Re: [PATCH] tracing/user_events: Run BPF program if attached

From: Steven Rostedt
Date: Tue May 09 2023 - 16:30:59 EST


On Tue, 9 May 2023 13:01:11 -0400
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> > I see no practical use case for bpf progs to be connected to user events.
>
> That's not a technical reason. Obviously they have a use case.

Alexei,

It was great having a chat with you during lunch at LSFMM/BPF!

Looking forward to your technical response that I believe are
legitimate requests. I'm replying here, as during our conversation, you
had the misperception that the user events had a system call when the
event was disabled. I told you I will point out the code that shows
that the kernel sets the bit, and that user space does not do a system
call when the event is disable.

>From the user space side, which does:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n60

/* Check if anyone is listening */
if (enabled) {
/* Yep, trace out our data */
writev(data_fd, (const struct iovec *)io, 2);

/* Increase the count */
count++;

printf("Something was attached, wrote data\n");
}


Where it told the kernel about that "enabled" variable:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/user_events/example.c#n47

if (event_reg(data_fd, "test u32 count", &write, &enabled) == -1)
return errno;

static int event_reg(int fd, const char *command, int *write, int *enabled)
{
struct user_reg reg = {0};

reg.size = sizeof(reg);
reg.enable_bit = 31;
reg.enable_size = sizeof(*enabled);
reg.enable_addr = (__u64)enabled;
reg.name_args = (__u64)command;

if (ioctl(fd, DIAG_IOCSREG, &reg) == -1)
return -1;

*write = reg.write_index;

return 0;
}

The above will add a trace event into tracefs. When someone does:

# echo 1 > /sys/kernel/tracing/user_events/test/enable

The kernel will trigger the class->reg function, defined by:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1804

user->class.reg = user_event_reg;

Which calls:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1555

update_enable_bit_for(user);

Which does:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n1465

update_enable_bit_for() {
[..]
user_event_enabler_update(user);


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n451

user_event_enabler_update() {
[..]
user_event_enabler_write(mm, enabler, true, &attempt);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/trace_events_user.c#n385

static int user_event_enabler_write(struct user_event_mm *mm,
struct user_event_enabler *enabler,
bool fixup_fault, int *attempt)
{
unsigned long uaddr = enabler->addr;
unsigned long *ptr;
struct page *page;
void *kaddr;
int ret;

lockdep_assert_held(&event_mutex);
mmap_assert_locked(mm->mm);

*attempt += 1;

/* Ensure MM has tasks, cannot use after exit_mm() */
if (refcount_read(&mm->tasks) == 0)
return -ENOENT;

if (unlikely(test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)) ||
test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
return -EBUSY;

ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
&page, NULL, NULL);

if (unlikely(ret <= 0)) {
if (!fixup_fault)
return -EFAULT;

if (!user_event_enabler_queue_fault(mm, enabler, *attempt))
pr_warn("user_events: Unable to queue fault handler\n");

return -EFAULT;
}

kaddr = kmap_local_page(page);
ptr = kaddr + (uaddr & ~PAGE_MASK);

/* Update bit atomically, user tracers must be atomic as well */
if (enabler->event && enabler->event->status)
set_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);
else
clear_bit(enabler->values & ENABLE_VAL_BIT_MASK, ptr);

kunmap_local(kaddr);
unpin_user_pages_dirty_lock(&page, 1, true);

return 0;
}

The above maps the user space address and then sets the bit that was
registered.

That is, it changes "enabled" to true, and the if statement:

if (enabled) {
/* Yep, trace out our data */
writev(data_fd, (const struct iovec *)io, 2);

/* Increase the count */
count++;

printf("Something was attached, wrote data\n");
}

Is now executed.

-- Steve