Re: [PATCH] user_events: Enable user processes to create and write to trace events

From: Masami Hiramatsu
Date: Thu Oct 07 2021 - 19:12:55 EST


On Thu, 7 Oct 2021 09:22:04 -0700
Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, Oct 07, 2021 at 11:17:38PM +0900, Masami Hiramatsu wrote:
> > > > > Psuedo code example of typical usage:
> > > > > page_fd = open("user_events_mmap", O_RDWR);
> > > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
> > > > >
> > > > > data_fd = open("user_events_data", O_RDWR);
> > > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test");
> > > >
> > > > Hmm, if the data_id is same as the ID in events/*/*/format, it will
> > > > be queried by libtraceevent or libftrace, isn't it?
> > > > And also, if you can define the user-event via dynamic_event interface,
> > > > it should be used instead of using ioctl on data channel.
> > > >
> > > These will not be the same ID. The data_id will be the offset within a
> > > page to check for status. We don't want to give user mode processes
> > > access to the entire tracefs, so we are isolating to a few files that
> > > will get wider access.
> >
> > OK, but how to query the id of the event which has been made
> > from writing dynamic_events file? (maybe query from name by
> > ioctl?)
> >
> Yes, you can either do ioctl or with the v2 patch you can now just cat
> user_events_status as you indicated previously to do :)

OK, let me check the v2.

>
> > > We still require the ioctl to register and cannot bet solely on
> > > dynamic_events. We want a fast way to register events on process startup
> > > and get the data_id back as fast as possible.
> >
> > What about using ioctl on 'user_event_status' file?
> > But in this case, the name is not showing the function.
> >
> > Then, what about renaming it as simply 'user_events'? :)
> > This file will show the current user events as similar to synth_events,
> > in addition, it will show the current status in "# comment".
> >
> > $ echo "user1 u32 arg1; u64 arg2" > user_events
> > $ cat user_events
> > user1 u32 arg1; u64 arg2
> > $ echo 1 > events/user_events/user1/enable
> > $ cat user_events
> > user1 u32 arg1; u64 arg2 # Used by ftrace
> >
> > In addition,
> > - user-process can do ioctl() to query offset instead of id from name.
> > - user-process can do mmap() the file to monitor the status.
> >
> Please see v2 patch, I do this pattern except it's '(Used by ftrace)'
> instead of '# Used by ftrace'.
>
> Format is id:name status

Hm, why I suggested to use "# status" is that the comment will be
removed when writing it. So you can do

cat user_events > ~/saved_events
(reboot)
cat ~/saved_events > user_events

to restore events :)

>
> > > The other thing is we need ref counting to know if the event is busy.
> > > Having the ID in the packet avoids having a fd per-event, but it also
> > > makes ref counting process lifetime of each event quite hard.
> >
> > Hmm, I don't think so. You can use an array of the pointer to
> > events on the private data of the struct file.
> > When you add (or start using) an event (this is identified by ioctl),
> > you can increment the event refcount and add it to the array.
> > When the file is closed (in exiting process), it will loop on the
> > array and decrement the refcount for each event.
> > Then, after all tracers disabled the event, ftrace can remove the
> > event in background (unless it is defined through 'dynamic_events' or
> > 'user_events').
> >
> Yes, I didn't say it's impossible :) It's quite hard and takes a lot
> more management. I don't see a clear benefit to that approach, why is it
> better than an fd lifetime? Not trying to be difficult, just trying to
> be pragmatic about what approach is best.

I'm not sure this point, you mean 1 fd == 1 event model?

> > > We want to avoid the kernel component touching any of the user data. For
> > > example eBPF programs will parse the data directly from definitions
> > > provided by the user. If the kernel side changed what the user expected
> > > from their side the eBPF program would fail to decode.
> >
> > Sorry, that's not doable unless we introduce a new __rel_loc__ attribute
> > to the ftrace (and libtraceevent). With that, the dynamic data
> > (like string) offset can be written as the offset from itself.
> >
> With this patch I have no problem getting hist, trigger and filter to
> work since the user is providing all correct offsets. Can you help me
> understand why this won't work? (It appears to work well without the
> kernel modifying user provided memory).

OK, let me check it.

>
> > > > [1234][0xdeadbeef][12][0]["Hello world"]
> >
> > Anyway, as far as you are using trace event the data format itself is
> > the contract between kernel and user, because *kernel* also has to
> > decode it (e.g. user will read the 'trace' file.) and other tools
> > like perf has to decode it too.
> >
> > The kernel will not modify the data (again with __rel_loc__ attr) but
> > it will add the common_* field and event ID as a header to its buffer
> > when record it.
> >
> Adding the common fields is fine, and works well with the current patch.
> I've been able to decode in ftrace, perf and eBPF without any issues for
> common types. __data_loc needs user mode cooperation, but no other type
> does. I've also tested hist, filter and trigger and all appear to work
> as I understand them to.

__data_loc has the data offset field, which is the offset from the
entry of the recorded event on the ring buffer. Since there are
common fields, the offset is not the offset from user given buffer.
So it must be changed by kernel.
(Note that common fields expected to exist in all events, so that
user scripts can process the fields commonly)

>
> > > We also want
> > > predicate filtering to work as cheap as possible. I would really like to
> > > keep offset manipulation entirely in the user space to avoid confusion
> > > across the various tracing mechanisms and avoid probing the user data
> > > upon each call (eBPF programs only selectively probe in data).
> >
> > OK, so let's add __rel_loc__ attribute. The rel_loc type will be
> >
> > struct rel_loc {
> > uint16_t len; /* The data size (including '\0' if string )*/
> > uint16_t offs; /* The offset of actual data from this field */
> > } __packed;
> >
> > Hmm, btw, this will be good for probe events... I don't need to pass
> > the base address with this attribute.
> >
> What's the difference between __rel_loc__ and __data_loc? Seems like
> instead of just offset it's length + offset?

In my idea, rel_loc is similar to the data_loc. It has the offset, but
the offset is the data offset from the rel_loc, not from the entry of
the recorded data. So kernel doesn't need to adjust it.

Thank you,

--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>