[RFD] posix-timers: CRIU woes

From: Thomas Gleixner
Date: Tue May 09 2023 - 17:42:49 EST


Hi!

This is a summary of several mails so that the CRIU people have the full
picture.

A recent syzbot report made me look at the timer ID management, which
was modified 7 years ago to accomodate CRIU:

5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")

and introduced that reported issue along with a bogus loop termination
problem. See

https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@xxxxxxxxxx/
https://lore.kernel.org/lkml/20230425183312.932345089@xxxxxxxxxxxxx

for details.

The intent to make the timer IDs per process is definitely correct, but
the implementation is beyond suboptimal. I really regret that I did not
catch this back then when picking those changes up.

The way it works is that each process keeps a 'next_timer_id' which it
uses to allocate the next timer. That allows CRIU to reconstruct timers
with the same ID by walking the ID space via

do {
timer_create(&timer, ...., &id);
if (id == original_id)
goto success;
timer_delete(&timer);
} while (idspace_not_exhausted());

That works by some definition of works, but it is problematic in two ways:

1) As the timer ID space is up to INT_MAX, a process which creates and
deletes timers frequently, can easily move up close to the INT_MAX
id space over time.

If such a process is checkpointed and restored, then the above loop
will run for at least an hour to restore a single timer.

And no, this is not only a hypothetical issue. There are legitimate
scenarios where threads are created and the control thread arms a
posix CPU timer on them. Such threads can be torn down on a regular
base due to thread pool consolidations. As CRIU does not happen
every 5 minutes it's not completely unlikely that such a process
surives quite some time on a particular host and thereby approaches
the ID space limit.

Sure we can restrict the ID space to a way smaller number so the
search wraps around earlier, but what's a sensible limit?

Though restricting the ID space has its own issue vs. backwards
compability. A process which created a timer on an older kernel with
an ID larger than the newer kernels ID limit cannot longer be
restored on that newer kernel.

Aside of that it does not solve the other problem this created:

2) That change created an user space ABI, which means that the kernel
side has to stick with this next ID search mechanism forever.

That prevents to get rid of that global lock and hash table by
sticking an xarray into task::signal which makes a lot of sense in
terms of cache locality and gets rid of the extra timer list
management in task::signal. Making this change would be very useful
to address some other issues in the posix-timer code without
creating yet more duct tape horrors.

Such a change obviously has to aim for a dense ID space to keep the
memory overhead low, but that breaks existing CRIU userspace because
dense ID space and next ID search does not fit together.

Next ID search is obviously creating non-recoverable holes in the
case that timers are deleted afterwards.

A dense ID space approach can create holes too, but they are
recoverable and well within the resource limits, because the process
has to be able to create enough timers in the first place in order
to release those in the middle.

With the next ID search brute force recovery is not possible on a
kernel with dense ID as there is no way to create all intermediate
timers first before reaching the one at the far end due to resource
limits.

So because of that half thought out user space ABI we are now up the
regression creek without a paddle, unless CRIU can accomodate to a
different restore mechanism to lift this restriction from the kernel.

Thoughts?

Thanks,

tglx