Re: Posix process cpu timer inaccuracies

From: Thomas Gleixner
Date: Fri Mar 01 2024 - 16:27:36 EST


Delyan!

On Mon, Feb 26 2024 at 16:29, Delyan Kratunov wrote:
>> I don't know and those assumptions have been clearly wrong at the point
>> where the tool was written.
>
> That was my impression as well, thanks for confirming. (I've found at least 3
> tools with this same incorrect belief)

The wonders of error proliferation by mindless copy & pasta and/or design
borrowing.

> Absolutely, the ability to write a profiler with perf_event_open is not in
> question at all. However, not every situation allows for PMU or
> perf_event_open access. Timers could form a nice middle ground, in exactly the
> way people have tried to use them.
>
> I'd like to push back a little on the "CLOCK_THREAD_CPUTIME_ID fixes things"
> point, though. From an application and library point of view, the per-thread
> clocks are harder to use - you need to either orchestrate every thread to
> participate voluntarily or poll the thread ids and create timers from another
> thread. In perf_event_open, this is solved via the .inherit/.inherit_thread
> bits.

I did not say it's easy and fixes all problems magically :)

As accessing a different thread/process requires ptrace permissions this
might be solvable via ptrace, which might turn out to be too heavy weight.

Though it would be certainly possible to implement inheritance for those
timers and let the kernel set them up for all existing and future threads.

That's a bit tricky vs. accounting on behalf of and association to the
profiler thread in the (v)fork() case and also needs some thought about
how the profiler thread gets informed of the newly associated timer_id,
but I think it's doable.

> More importantly, they don't work for all workloads. If I have 10 threads that
> each run for 5ms, a 10ms process timer would fire 5 times, while per-thread
> 10ms timers would never fire. You can easily imagine an application that
> accrues all its cpu time in a way that doesn't generate a single signal (in
> the extreme, threads only living a single tick).

That's true, but you have to look at the life time rules of those
timers.

A CLOCK_THREAD_CPUTIME_ID timer is owned by the thread which creates it,
no matter what the monitored target thread is. So when the monitored
thread exits then it disarms the timer, but the timer itself stays
accessible to the owner. That means the owner can still query the timer.

As of today a timer_get(CLOCK_THREAD_CPUTIME_ID) after the monitored
thread exited results in { .it_value = 0, .it_interval = 0 }.

We can't change in general, but if we go and do the inheritance mode,
then the timer would be owned by the profiler thread. Even without
inheritance mode we can handle a special flag for timer_create() to
denote that this is a magic timer :)

So that magic flag would preserve the accumulated runtime when the
thread exits in the timer in some way and either return that in
timer_get() along with some magic to denote that the monitored thread is
gone or add a new timer_get_foo() syscall for it.

Whether the profiler then polls the timers periodically or acts on an
exit signal that's a user space implementation detail.

Thanks,

tglx