Re: [PATCH v2 1/3] tracing/user_events: Fix incorrect return value for writing operation when events are disabled

From: Beau Belgrave
Date: Mon Jun 19 2023 - 14:41:22 EST


On Mon, Jun 19, 2023 at 04:51:56PM +0800, sunliming wrote:
> Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> 于2023年6月17日周六 00:08写道:
> >
> > On Fri, Jun 09, 2023 at 11:03:00AM +0800, sunliming wrote:
> > > The writing operation return the count of writes whether events are
> > > enabled or disabled. This is incorrect when events are disabled. Fix
> > > this by just return -ENOENT when events are disabled.
> > >
> >
> > When testing this patch locally I found that we would occasionally get
> > -ENOENT when events were enabled, but then become disabled, since writes
> > do not have any locking around the tracepoint checks for performance
> > reasons.
> >
> > I've asked a few peers of mine their thoughts on this, whether an error
> > should result when there are no enabled events. The consensus I've heard
> > back is that they would not consider this case an actual error, just as
> > writing to /dev/null does not actually return an error.
> >
> > However, if you feel strongly we need this and have a good use case, it
> > seems better to enable this logic behind a flag instead of having it
> > default based on my conversations with others.
> >
> > Thanks,
> > -Beau
>
>
>
> There is indeed a problem. Once enabled, perform the write operation
> immediately.
>

The immediate write does work, and gets put into a buffer. The ftrace
and perf self tests do the above case. So, no worries at this point.

> Now,when the event is disabled, the trace record appears to be lost.

I'm taking this to mean, if in between the time of the bit check and the
actual write() /writev() syscall the event becomes disabled, the event
won't write to the buffer. Yes, that is expected.

> In some situations
> where data timing is sensitive, it may cause confusion. In this case,
> not returning an
> error (as mentioned in your reply, it is not considered this case an
> actual error) and
> returning 0 ( meaning that the number of data to be written is 0) may
> be a good way
> to handle it?

This is where I get a little lost. What would a user process do with a
return of 0 bytes? It shouldn't retry, since it just hit that small
timing window. In reality, it just incurred a temporary excessive
syscall cost, but no real data loss (the operator/admin turned the event
off).

I'm missing why you feel it's important the user process know such a
window was hit?

Can you help me understand that?

I do think returning 0 bytes is better than an error here, but I'd
really like to know why the user process wants to know at all. Maybe
they have user-space only logging and want to be able to mark there if
it's in both spots (kernel and user buffers)?

Thanks,
-Beau

> Thanks,
> -Sunliming
>
> >
> > > Signed-off-by: sunliming <sunliming@xxxxxxxxxx>
> > > ---
> > > kernel/trace/trace_events_user.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> > > index 1ac5ba5685ed..92204bbe79da 100644
> > > --- a/kernel/trace/trace_events_user.c
> > > +++ b/kernel/trace/trace_events_user.c
> > > @@ -1957,7 +1957,8 @@ static ssize_t user_events_write_core(struct file *file, struct iov_iter *i)
> > >
> > > if (unlikely(faulted))
> > > return -EFAULT;
> > > - }
> > > + } else
> > > + return -ENOENT;
> > >
> > > return ret;
> > > }
> > > --
> > > 2.25.1