[PATCH 2.6.12.5] NPTL signal delivery deadlock fix

From: Bhavesh P. Davda
Date: Wed Aug 17 2005 - 13:29:47 EST


This bug is quite subtle and only happens in a very interesting
situation where a real-time threaded process is in the middle of a
coredump when someone whacks it with a SIGKILL. However, this deadlock
leaves the system pretty hosed and you have to reboot to recover.

Not good for real-time priority-preemption applications like our
telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
processes, many of them multi-threaded, interacting with each other for
high volume call processing.

- Bhavesh

Also, for your reading pleasure, a complete analysis of how the system
gets into a deadlock due to this bug. I wanted to post it because I
spent several hours analysing this.

--
Bhavesh P. Davda | Distinguished Member of Technical Staff | Avaya |
1300 West 120th Avenue | B3-B03 | Westminster, CO 80234 | U.S.A. |
Voice/Fax: 303.538.4438 | bhavesh@xxxxxxxxx
diff -Naur linux-2.6.12.5/kernel/signal.c linux-2.6.12.5-sigfix/kernel/signal.c
--- linux-2.6.12.5/kernel/signal.c 2005-08-14 18:20:18.000000000 -0600
+++ linux-2.6.12.5-sigfix/kernel/signal.c 2005-08-17 11:36:20.547600092 -0600
@@ -686,7 +686,7 @@
{
struct task_struct *t;

- if (p->flags & SIGNAL_GROUP_EXIT)
+ if (p->signal->flags & SIGNAL_GROUP_EXIT)
/*
* The process is in the middle of dying already.
*/
When bash sends SIGABRT to rt-pthreaded-app main thread:

bash: sys_kill(pid, SIGABRT)
kill_something_info(SIGABRT, &info, pid)
kill_proc_info(SIGABRT, info, pid)
p = find_task_by_pid(pid), group_send_sig_info(SIGABRT, info, p)
__group_send_sig_info(SIGABRT, info, p)
__group_complete_signal(SIGABRT, p)
Still bash, p==rt-pthreaded-app main thread:

static void __group_complete_signal(int sig, struct task_struct *p)
{
unsigned int mask;
struct task_struct *t;

/*
* Don't bother traced and stopped tasks (but
* SIGKILL will punch through that).
*/
mask = TASK_STOPPED | TASK_TRACED;
if (sig == SIGKILL)
mask = 0;

==> mask == TASK_STOPPED|TASK_TRACED
/*
* Now find a thread we can wake up to take the signal off the queue.
*
* If the main thread wants the signal, it gets first crack.
* Probably the least surprising to the average bear.
*/
if (wants_signal(sig, p, mask))
t = p;
==> t = p (rt-pthreaded-app main thread)
else if (thread_group_empty(p))
/*
* There is just one thread and it does not need to be woken.
* It will dequeue unblocked signals before it runs again.
*/
return;
else {
/*
* Otherwise try to find a suitable thread.
*/
t = p->signal->curr_target;
if (t == NULL)
/* restart balancing at this thread */
t = p->signal->curr_target = p;
BUG_ON(t->tgid != p->tgid);

while (!wants_signal(sig, t, mask)) {
t = next_thread(t);
if (t == p->signal->curr_target)
/*
* No thread needs to be woken.
* Any eligible threads will see
* the signal in the queue soon.
*/
return;
}
p->signal->curr_target = t;
}

/*
* Found a killable thread. If the signal will be fatal,
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) && !(p->signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !(t->ptrace & PT_PTRACED))) {
==> sig_fatal(p, SIGABRT) true
==> SIGNAL_GROUP_EXIT is not set yet
==> SIGABRT is not blocked
==> p is not PT_PTRACED
/*
* This signal will be fatal to the whole group.
*/
if (!sig_kernel_coredump(sig)) {
==> SIGABRT is sig_kernel_coredump(), skip
/*
* Start a group exit and wake everybody up.
* This way we don't have other threads
* running and doing things after a slower
* thread has the fatal signal pending.
*/
p->signal->flags = SIGNAL_GROUP_EXIT;
p->signal->group_exit_code = sig;
p->signal->group_stop_count = 0;
t = p;
do {
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
t = next_thread(t);
} while (t != p);
return;
}

/*
* There will be a core dump. We make all threads other
* than the chosen one go into a group stop so that nothing
* happens until it gets scheduled, takes the signal off
* the shared queue, and does the core dump. This is a
* little more complicated than strictly necessary, but it
* keeps the signal state that winds up in the core dump
* unchanged from the death state, e.g. which thread had
* the core-dump signal unblocked.
*/
rm_from_queue(SIG_KERNEL_STOP_MASK, &t->pending);
rm_from_queue(SIG_KERNEL_STOP_MASK, &p->signal->shared_pending);
p->signal->group_stop_count = 0;
p->signal->group_exit_task = t;
t = p;
==> Start with thread being killed
do {
p->signal->group_stop_count++;
==> For rt-pthreaded-app this will be done twice (for the 2 subthreads)
signal_wake_up(t, 0);
==> This is a no-op so far, because the subthread "t" doesn't have a signal
t = next_thread(t);
} while (t != p);
wake_up_process(p->signal->group_exit_task);
==> This wakes up the main rt-pthreaded-app thread. At this point in time,
==> group_stop_count == 2, but SIGNAL_GROUP_EXIT is still not set
return;
==> BASH IS DONE.
}

/*
* The signal is already in the shared-pending queue.
* Tell the chosen thread to wake up and dequeue it.
*/
signal_wake_up(t, sig == SIGKILL);
return;
}


rt-pthreaded-app main thread:
======================
Coming out of schedule(), it will look for pending signals

do_notify_resume()
do_signal()
signr = get_signal_to_deliver(&info, &ka, regs, NULL);
get_signal_to_deliver()
if (unlikely(current->signal->group_stop_count > 0) &&
handle_group_stop())
==> group_stop_count is 2, so call handle_group_stop()
handle_group_stop()
if (current->signal->group_exit_task == current) {
==> This is true
/* Group stop is so we can do a core dump,
* We are the initiating thread, so get on with it. */
current->signal->group_exit_task = NULL;
return 0;
}
==> back to get_signal_to_deliver()
signr = dequeue_signal(current, mask, info);
==> signr == SIGABRT
if (!signr) break; /* will return 0 */ (not true, signr==SIGABRT)
if ((current->ptrace & PT_PTRACED) && signr != SIGKILL) {
(not true, skip)
ka = &current->sighand->action[signr-1];
if (ka->sa.sa_handler == SIG_IGN) /* Do nothing. */
continue; (not true, handler == SIG_DFL)
if (ka->sa.sa_handler != SIG_DFL) {
(not true, skip)
if (sig_kernel_ignore(signr)) /* Default is nothing. */ continue;
(not true, skip)
if (current->pid == 1) continue; (not true, skip)
if (sig_kernel_stop(signr)) { (not true, skip)
/* Anything else is fatal, maybe with a core dump. */
current->flags |= PF_SIGNALED;
if (sig_kernel_coredump(signr)) {
==> TRUE
do_coredump((long)signr, signr, regs);

do_coredump(SIGABRT, SIGABRT, regs)
current->signal->flags = SIGNAL_GROUP_EXIT;
==> Finally we set SIGNAL_GROUP_EXIT here
current->signal->group_exit_code = exit_code;
==> group_exit_code == SIGABRT
coredump_wait(mm);

coredump_wait(mm)
mm->core_waiters++; /* let other threads block */
/* give other threads a chance to run: */
yield();
zap_threads(mm);


zap_threads(mm)
do_each_thread(g,p)
if (mm == p->mm && p != tsk) {
force_sig_specific(SIGKILL, p);
==> This is where the rt-pthreaded-app subthreads are sent a SIGKILL

force_sig_specific(SIGKILL, p)
specific_send_sig_info(SIGKILL, (void *)2, t);
specific_send_sig_info(SIGKILL, 2, t)
ret = send_signal(SIGKILL, 2, t, &t->pending);
send_signal(SIGKILL, 2, t, &t->pending)
/*
* fast-pathed signals for kernel-internal things like SIGSTOP
* or SIGKILL.
*/
if ((unsigned long)info == 2) goto out_set;
(True)
sigaddset(&signals->signal, sig);
return ret; // returns 0
Back to specific_send_sig_info(SIGKILL, 2, t)
if (!ret && !sigismember(&t->blocked, sig))
signal_wake_up(t, sig == SIGKILL);
(True)
signal_wake_up(t, TRUE)
set_tsk_thread_flag(t, TIF_SIGPENDING);
mask = TASK_INTERRUPTIBLE;
if (resume) (True)
mask |= TASK_STOPPED | TASK_TRACED;
if (!wake_up_state(t, mask))
kick_process(t)
==> This will wake up rt-pthreaded-app subthreads whether they are in
==> TASK_INTERRUPTIBLE, TASK_STOPPED, or TASK_TRACED states
==> THIS WON'T WAKE UP TASK_UNINTERRUPTIBLE THREADS

==> At this point in time:
==> group_stop_count == 2, SIGNAL_GROUP_EXIT is set in all threads
mm->core_waiters++;
==> This finally becomes 3 (main + 2 subthreads)
}
while_each_thread(g,p);

Back to coredump_wait()
if (--mm->core_waiters) {
==> Main thread decrements core_waiters back to 2.
up_write(&mm->mmap_sem);
wait_for_completion(&startup_done);


NOW, IF THE MAIN rt-pthreaded-app THREAD IS SENT A SIGKILL WHILE WAITING

handle_stop_signal()
if (p->flags & SIGNAL_GROUP_EXIT) return;
***** WRONG CHECK! SHOULD BE (p->signal->flags & SIGNAL_GROUP_EXIT) *****
else if (sig == SIGKILL) {
p->signal->flags = 0;
}
********* WHOOPS! Just cleared SIGNAL_GROUP_EXIT **************

rt-pthreaded-app subthread:
====================
Coming out of schedule(), it will look for pending signals

do_notify_resume()
do_signal()
signr = get_signal_to_deliver(&info, &ka, regs, NULL);
get_signal_to_deliver()
if (unlikely(current->signal->group_stop_count > 0) &&
handle_group_stop())
==> group_stop_count is 2, so call handle_group_stop()
handle_group_stop()
if (current->signal->group_exit_task == current) {
(False)
if (current->signal->flags & SIGNAL_GROUP_EXIT) return;
(SHOULD HAVE BEEN TRUE, BUT WAS CLEARED BY MAIN THREAD)
stop_count = --current->signal->group_stop_count;
==> group_stop_count is now 1
if (stop_count == 0)
current->signal->flags = SIGNAL_STOP_STOPPED;
current->exit_code = current->signal->group_exit_code;
==> exit_code == SIGABRT
set_current_state(TASK_STOPPED);
==> Task enters TASK_STOPPED state
finish_stop(stop_count);

DEADLOCK!