Re: [PROBLEM] Machine Freezes while Running Crossover Office

From: Ingo Molnar
Date: Mon May 30 2005 - 14:18:23 EST



* Linus Torvalds <torvalds@xxxxxxxx> wrote:

> On Mon, 30 May 2005, Linus Torvalds wrote:
> >
> > So it's either a kernel scheduling problem, or Crossover running with RT
> > priority and screwing up.
>
> Btw, crossover being screwed up and runnign with RT priority would
> also explain why stracing it makes the problem go away - the tracing
> will cause the RT process to halt at system calls and yield to the
> tracer, which isn't RT.
>
> Of course, the same goes for a scheduler bug, so it's not like this
> proves anything one way or the other, but considering that others
> aren't reporting this problem with other programs..

Pekka, if none of the previous methods helps in better debugging this,
then one of the easiest ways to catch scheduler latency bugs (e.g.
related to sync wakeups, etc.) would be to try the -RT tree.

It has a built-in kernel/scheduler tracer that gets started upon wakeup
and is stopped when the task finally runs. If we lose a preemption
somewhere then this kernel should catch it. The -RT tree is development
code, but should work fine on most systems. Here's a QuickStart:

1)
download the latest patch from http://redhat.com/~mingo/realtime-preempt/

2)
patch your 2.6.12-rc5 tree with it

3)
copy your usual .config into this tree and do 'make oldconfig' - just
accept the default options it offers, except for the following two
cases: when it says "Preemption Mode", pick #1:

1. No Forced Preemption (Server) (PREEMPT_NONE) (NEW)

when it asks:

Interrupts-off critical section latency timing (CRITICAL_IRQSOFF_TIMING)

pick 'y'.

when it asks:

Latency tracing (LATENCY_TRACE) [N/y/?] (NEW) y

pick 'y' too.

4) compile & install your kernel as ususal and reboot into it.

now you'll be running a kernel with tracing built-in. The kernel comes
with the 'wakeup-timing' feature enabled by default, which can be
started via:

echo 0 > /proc/sys/kernel/preempt_max_latency

after this point you should get the maximum scheduling latencies
reported to the syslog:

[root@saturn ~]# echo 0 > /proc/sys/kernel/preempt_max_latency
[root@saturn ~]# dmesg | tail
( IRQ 17-776 |#0): new 1 us maximum-latency wakeup.
( ksoftirqd/0-2 |#0): new 1 us maximum-latency wakeup.
( kjournald-820 |#0): new 1 us maximum-latency wakeup.
( IRQ 14-781 |#0): new 3 us maximum-latency wakeup.
( ksoftirqd/0-2 |#0): new 3 us maximum-latency wakeup.
( IRQ 17-776 |#0): new 12 us maximum-latency wakeup.
( IRQ 17-776 |#0): new 36 us maximum-latency wakeup.
( IRQ 17-776 |#0): new 38 us maximum-latency wakeup.
( IRQ 17-776 |#0): new 40 us maximum-latency wakeup.
( IRQ 17-776 |#0): new 70 us maximum-latency wakeup.
[root@saturn ~]#

the kernel function trace of the largest latency will be under
/proc/latency_trace:

[root@saturn ~]# cat /proc/latency_trace
preemption latency trace v1.1.4 on 2.6.12-rc5-RT-V0.7.47-15
--------------------------------------------------------------------
latency: 3048 us, #471/471, CPU#0 | (M:server VP:0, KP:0, SP:1 HP:1 #P:1)
-----------------
| task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
head-3704 0dn.. 0us : try_to_wake_up (wake_up_process)
head-3704 0dn.. 0us : try_to_wake_up <<...>-2> (6a 75):
head-3704 0dn.. 0us : wake_up_process (do_softirq)
head-3704 0dn.. 0us+< (608)
cat-3709 0dnh. 46us+: do_nmi (memcpy)
cat-3709 0dnh. 48us+: profile_hit (nmi_watchdog_tick)
cat-3709 0dnh. 148us+: do_nmi (memcpy)
cat-3709 0dnh. 150us+: profile_hit (nmi_watchdog_tick)
cat-3709 0dn.. 182us : smp_apic_timer_interrupt (c01b3b35 0 0)
cat-3709 0dnh. 182us : irq_exit (apic_timer_interrupt)
cat-3709 0dn.. 182us : do_softirq (apic_timer_interrupt)
cat-3709 0dn.. 182us : __do_softirq (do_softirq)
cat-3709 0dn.. 183us+< (608)
cat-3709 0dnh. 251us : do_nmi (memcpy)
[...]

(if you have the APIC code and the NMI watchdog enabled then you'll also
get a trace of userspace code looping in irqs-off sections.)

at this point try to reproduce the X hang. Do you get any large
(3000-5000 usecs) latency reported? (assuming you can reproduce the hang
under the -RT kernel)

(if you dont get any large latency reported by the tracer but the hangs
still happen then there's still a way to get this debugged, by running
the tracer in a free-running manually-triggered mode - i'll tell more
about this if it becomes necessary.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/