Re: CPU POSIX timers livelock

From: Björn Steinbrink
Date: Fri May 02 2008 - 11:21:41 EST


[Added Roland McGrath and Frank Mayhar to Cc:, as this sounds similar
enough to what has been discussed here http://lkml.org/lkml/2008/2/6/505]

On 2008.05.02 17:05:41 +0200, Petr Tesarik wrote:
> Hello,
>
> there seems to be a possible livelock in the CPU timers (actually, we've
> experienced it once already). The analysis lead to the conclusion that
> fixing this properly will be a bit harder.
>
> Problem Statement
>
> 1. Any process can set ITIMER_PROF as short as 1 timer tick
> 2. If the process is CPU-bound (e.g. a number-crunching application),
> the timer will expire with every timer tick
> 3. If the process has N threads and each thread runs on its own
> CPU, the timer will expire N times per timer tick
>
> Now, this can become quite interesting on a larger system. For instance,
> with 128 CPUs and a (conservative) setting of HZ 300, any process can
> effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty
> fast, but it can get worse if you consider the pitfalls of a NUMA
> architecture.
>
> Whenever the timer expires, a new expiration time must be computed for
> each thread in the thread group. The values are stored in task_struct of
> each thread, and IIRC those are kept local to the CPU where the thread
> is running (quite logically). Put in another way, walking all threads
> means touching remote memory except for the few task_struct which are
> local to the recomputing CPU. In the 128-CPU example, if each node has 2
> CPUs (think Altix), only 2 accesses are to the local node and 126
> accesses are to remote memory. These are already quite expensive, but it
> gets even worse. The timer interrupt is generated locally for each CPU,
> so if the process is spread over all of them (see above), all CPUs
> recompute the timer (hint: write access), thus constantly invalidating
> local caches of the remote memory.
>
> And those computations cannot run in parallel, of course, because
> everything is serialized on sighand->siglock (see run_posix_cpu_timers).
> So, each time one CPU finishes walking the threads, there's already a
> queue of all the other CPUs waiting to do the same again. For this kind
> of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may
> be well insufficient. Even more so, if somebody else occasionally takes
> write lock on tasklist_lock. It may so happen that the system spends all
> CPU cycles re-computing the profiling timers over and over again and
> never make any progress. Note this is partly caused by the fact that the
> time needed to handle the profiling timer is itself also counted to the
> profiling time.
>
> This simply does not scale for large systems (and I was not even
> considering those really large 1024p ones).
>
> Ideas
>
> Any ideas for improvement? Obviously, storing per-process expiration
> times in one place instead of dividing the interval by the number of
> currently running threads would help here. But it would spoil the fast
> path (when there are no timers at all). Avoiding that multiple threads
> re-compute the same values might also help (although I've got no idea
> how to implement this).
>
> Any more comments (before I start hacking something you'll NAK)?
>
> BTW did POSIX really mean to allow an indefinitely small interval for
> the SIGPROF?
>
> Petr Tesarik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/