Re: doublefault debugging (was Re: Linux v2.5.62 --- spontaneous reboots)

From: Linus Torvalds (torvalds@transmeta.com)
Date: Thu Feb 20 2003 - 13:01:15 EST


On Thu, 20 Feb 2003, Ingo Molnar wrote:
>
> a true heisenbug. I cannot reproduce it anymore. Anyway, from the serial
> console i collected 3 instances of crashes - whatever it's worth.

Pretty much every single time, release_task() has been there on the
backtrace.

In fact, I bet you this code in do_exit() is the cause:

        preempt_disable();

        if (tsk->exit_signal == -1)
*** release_task(tsk); ***

        schedule();

Note how "release_task()" will be releasing the stack that the process is
running on right now. And the reason it doesn't crash _every_ time is
simply that you need to have:

 - another memory allocation that picks up that page and fills it with
   something else in order to get a corrupted stack
 - and something delays schedule() so that you have time to race _and_ you
   need the stack. Which is why most of the oopses have an interrupt come
   in inside schedule (see the "common_interrupt()" thing

In other words, I think we need to have schedule_tail() do the
release_task(), otherwise we'd release it too early while the task
structure (and the stack) are both still in use.

You owe me a patch.

                        Linus

---

> [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010bc28>] handle_IRQ_event+0x38/0x60 > [<c010bf6b>] do_IRQ+0x14b/0x1e0 > [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010bf6b>] do_IRQ+0x14b/0x1e0 > [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010bf6b>] do_IRQ+0x14b/0x1e0 > [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c011e06c>] __put_task_struct+0x7c/0x90 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010bc28>] handle_IRQ_event+0x38/0x60 > [<c010bf6b>] do_IRQ+0x14b/0x1e0 > [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

> [<c010bc28>] handle_IRQ_event+0x38/0x60 > [<c010bf6b>] do_IRQ+0x14b/0x1e0 > [<c010a594>] common_interrupt+0x18/0x20 > [<c010a691>] error_code+0x2d/0x38 > [<c011b881>] schedule+0x3a1/0x3d0 > [<c01219fd>] release_task+0x17d/0x200 > [<c011e70f>] mmput+0x1f/0xc0 > [<c0122cad>] do_exit+0x31d/0x3b0

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Feb 23 2003 - 22:00:30 EST