Re: [PATCH v2 3/5] tracing/user_events: Add auto cleanup and a flag to persist events

From: Beau Belgrave
Date: Wed Jun 07 2023 - 15:17:04 EST


On Tue, Jun 06, 2023 at 06:26:56PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 5, 2023 at 4:39 PM Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:
> > + /*
> > + * When the event is not enabled for auto-delete there will always
> > + * be at least 1 reference to the event. During the event creation
> > + * we initially set the refcnt to 2 to achieve this. In those cases
> > + * the caller must acquire event_mutex and after decrement check if
> > + * the refcnt is 1, meaning this is the last reference. When auto
> > + * delete is enabled, there will only be 1 ref, IE: refcnt will be
> > + * only set to 1 during creation to allow the below checks to go
> > + * through upon the last put. The last put must always be done with
> > + * the event mutex held.
> > + */
> > + if (!locked) {
> > + lockdep_assert_not_held(&event_mutex);
> > + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex);
> > + } else {
> > + lockdep_assert_held(&event_mutex);
> > + delete = refcount_dec_and_test(&user->refcnt);
> > + }
> > +
> > + if (!delete)
> > + return;
> > +
> > + /* We now have the event_mutex in all cases */
> > +
> > + if (user->reg_flags & USER_EVENT_REG_PERSIST) {
> > + /* We should not get here when persist flag is set */
> > + pr_alert("BUG: Auto-delete engaged on persistent event\n");
> > + goto out;
> > + }
> > +
> > + /*
> > + * Unfortunately we have to attempt the actual destroy in a work
> > + * queue. This is because not all cases handle a trace_event_call
> > + * being removed within the class->reg() operation for unregister.
> > + */
> > + INIT_WORK(&user->put_work, delayed_destroy_user_event);
> > +
> > + /*
> > + * Since the event is still in the hashtable, we have to re-inc
> > + * the ref count to 1. This count will be decremented and checked
> > + * in the work queue to ensure it's still the last ref. This is
> > + * needed because a user-process could register the same event in
> > + * between the time of event_mutex release and the work queue
> > + * running the delayed destroy. If we removed the item now from
> > + * the hashtable, this would result in a timing window where a
> > + * user process would fail a register because the trace_event_call
> > + * register would fail in the tracing layers.
> > + */
> > + refcount_set(&user->refcnt, 1);
>
> The recnt-ing scheme is quite unorthodox.

Yes, it's unfortunately because we have to keep the event in the hashtable.

Typically we'd just remove the event from the hashtable, ref_dec and
then upon final ref_dec free it. The problem with that is the event in
the hashtable is an actual trace_event exposed via tracefs/perf. It
prevents us from removing it at this time as the comment tries to
explain.

> Atomically decrementing it to zero and then immediately set it back to 1?
> Smells racy.

Thanks for pointing this out, I likely need another comment in the code
explaining why this is ok.

It might smell that way :) But the only way once it's zero to get
another reference is by acquiring event_mutex and then calling
find_user_event(). This is why it's important that at this phase we hold
the event_mutex in all cases.

> Another process can go through the same code and do another dec and set to 1
> and we'll have two work queued?

Once we set it to 1, we cannot get into this code path again until the
work that was queued is finished. The scheduled work acquires
event_mutex and ensures it's still the last reference before doing any
actual deletion or a refcount_dec that would allow for another work
queue via this path.

> Will mutex_lock help somehow? If yes, then why atomic refcnt?

I hopefully fully explained the mutex_lock above, if it's confusing
still let me know. The rason for the atomic refcnt though is because you
initially get a reference to a user_event via find_user_event() under
the lock, but then each FD (and each enabler) has a reference to the
underlying user_event (and thus the trace_event). Fork() and enablement
registers are done outside of this lock for performance reasons. When
the task exits, we also close down the enablers for that mm/task, and so
the lock is not there either (and why having the less used
refcount_dec_and_mutex_lock being required on put when it's not
acquired).

Thanks,
-Beau