Posix process cpu timer inaccuracies

From: Delyan Kratunov
Date: Sat Feb 10 2024 - 20:47:55 EST

Hi folks,

I've heard about issues with process cpu timers for a while (~years) but only
recently found the time to look into them. I'm starting this thread in an
attempt to get directional opinions on how to resolve them (I'm happy to do
the work itself).

Let's take setitimer(2). The man page says that "Under very heavy loading, an
ITIMER_REAL timer may expire before the signal from a previous expiration has
been delivered." This is true but incomplete - the same issue plagues
ITIMER_PROF and ITIMER_VIRTUAL as well. I'll call this property "completeness"
i.e. that all accrued process CPU time should be accounted by the signals
delivered to the process.

A second issue is proportionality. Specifically for setitimer, there appears to
be an expectation in userspace that the number of signals received per thread
is proportional to that thread's CPU time. I'm not sure where this belief is
coming from but my guess is that people assumed multi-threadedness preserved
the "sample a stack trace on every SIGPROF" methodology from single-threaded
setitimer usage. I don't know if it was ever possible but you cannot currently
implement this strategy and get good data out of it. Yet, there's software
like gperftools that assumes you can. (Did this ever work well?)

1. Completeness

The crux of the completeness issue is that process CPU time can easily be
accrued faster than signals on a shared queue can be dequeued. Relatively
large time intervals like 10ms can trivially drop signals on 12-core 24-thread
system but in my tests, 2-core 4-thread systems behave just as poorly under
enough load.

There's a few possible improvements to alleviate or fix this.

a. Instead of delivering the signal to the shared queue, we can deliver it to
the task that won the "process cpu timers" race. This improves the situation
by effectively sharding the collision space by the number of runnable threads.

b. An alternative solution would be to search through the threads for one that
doesn't have the signal queued and deliver to it. This leads to more overhead
but better signal delivery guarantees. However, it also has worse behavior
w.r.t. waking up idle threads.

c. A third solution may be to treat SIGPROF and SIGVTALRM as rt-signals when
delivered due to an itimer expiring. I'm not convinced this is necessary but
it's the most complete solution.

2. Proportionally

The issue of proportionality is really the issue of "can you use signals for
multi-threaded profiling at all." As it stands, there's no mechanism that's
ensuring proportionality, so the distribution across threads is meaningless.

The only way I can think of to actually enforce this property is to keep
snapshots of per-thread cpu time and diff them from one SIGPROF to the next to
determine the target thread (by doing a weighted random choice). It's not _a
lot_ of work but it's certainly a little more overhead and a fair bit of
complexity. With POSIX_CPU_TIMERS_TASK_WORK=y, this extra overhead shouldn't
impact things too much.

Note that proportionality is orthogonal to completeness - while you can
configure posix timers to use rt-signals with timer_create (which fixes
completeness), they still have the same distribution issues.

Overall, I'd love to hear opinions on 1) whether either or both of these
concerns are worth fixing (I can expand on why I think they are) and 2) the
direction the work should take.

Thanks for reading all this,
-- Delyan