Re: 2.6.19-rt14 slowdown compared to 2.6.19

From: Ingo Molnar
Date: Sat Dec 30 2006 - 05:07:36 EST



* Chen, Tim C <tim.c.chen@xxxxxxxxx> wrote:

> Ingo Molnar wrote:
> >
> > If you'd like to profile this yourself then the lowest-cost way of
> > profiling lock contention on -rt is to use the yum kernel and run the
> > attached trace-it-lock-prof.c code on the box while your workload is
> > in 'steady state' (and is showing those extended idle times):
> >
> > ./trace-it-lock-prof > trace.txt
> >
> > this captures up to 1 second worth of system activity, on the current
> > CPU. Then you can construct the histogram via:
> >
> > grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
> > uniq -c | sort -n > prof.txt
> >
>
> I did lock profiling on Volanomark as suggested and obtained the
> profile that is listed below.

thanks - this is really useful!

> 246 __sched_text_start()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
> 264 rt_mutex_slowunlock()<-rt_mutex_unlock()<-rt_up_read()<-(-1)()
> 334 __sched_text_start()<-schedule()<-posix_cpu_timers_thread()<-kthread()
> 437 __sched_text_start()<-schedule()<-do_futex()<-sys_futex()
> 467 (-1)()<-(0)()<-(0)()<-(0)()
> 495 __sched_text_start()<-preempt_schedule()<-__spin_unlock_irqrestore()<-rt_mutex_adjust_prio()
> 497 __netif_rx_schedule()<-netif_rx()<-loopback_xmit()<-(-1)()
> 499 __sched_text_start()<-schedule()<-schedule_timeout()<-sk_wait_data()
> 500 tcp_recvmsg()<-sock_common_recvmsg()<-sock_recvmsg()<-(-1)()
> 503 __rt_down_read()<-rt_down_read()<-do_futex()<-(-1)()
> 1160 __sched_text_start()<-schedule()<-ksoftirqd()<-kthread()
> 1433 __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> 1497 child_rip()<-(-1)()<-(0)()<-(0)()
> 1936 __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
>
> Looks like the idle time I saw was due to lock contention during call
> to futex_wake, which requires acquisition of current->mm->mmap_sem.
> Many of the java threads share mm and result in concurrent access to
> common mm. [...]

ah. This explains why i'm not seeing this bad contention in a comparable
workload (hackbench.c): because hackbench uses processes not threads.

> [...] Looks like under rt case there is no special treatment to read
> locking so the read lock accesses are contended under __rt_down_read.
> For non rt case, __down_read makes the distinction for read lock
> access and the read lockings do not contend.

yeah, makes sense. I'll do something about this.

> Things are made worse here as this delayed waking up processes locked
> by the futex. See also a snippet of the latency_trace below.
>
> <idle>-0 2D..2 5821us!: thread_return <softirq--31> (150 20)
> <idle>-0 2DN.1 6278us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
> <idle>-0 2DN.1 6278us : (0)()<-(0)()<-(0)()<-(0)()
> java-6648 2D..2 6280us+: thread_return <<idle>-0> (20 -4)
> java-6648 2D..1 6296us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
> java-6648 2D..1 6296us : rt_mutex_unlock()<-rt_up_read()<-do_futex()<-(-1)()
> java-6648 2D..2 6297us : effective_prio <<...>-6673> (-4 -4)
> java-6648 2D..2 6297us : __activate_task <<...>-6673> (-4 1)
> java-6648 2.... 6297us < (-11)
> java-6648 2.... 6298us+> sys_futex (0000000000afaf50 0000000000000001 0000000000000001)
> java-6648 2...1 6315us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
> java-6648 2...1 6315us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> java-6648 2D..2 6316us+: deactivate_task <java-6648> (-4 1)
> <idle>-0 2D..2 6318us+: thread_return <java-6648> (-4 20)
> <idle>-0 2DN.1 6327us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
> <idle>-0 2DN.1 6328us+: (0)()<-(0)()<-(0)()<-(0)()
> java-6629 2D..2 6330us+: thread_return <<idle>-0> (20 -4)
> java-6629 2D..1 6347us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
> java-6629 2D..1 6347us : rt_mutex_unlock()<-rt_up_read()<-futex_wake()<-(-1)()
> java-6629 2D..2 6348us : effective_prio <java-6235> (-4 -4)
> java-6629 2D..2 6349us : __activate_task <java-6235> (-4 1)
> java-6629 2.... 6350us+< (0)
> java-6629 2.... 6352us+> sys_futex (0000000000afc1dc 0000000000000001 0000000000000001)
> java-6629 2...1 6368us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
> java-6629 2...1 6368us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
> java-6629 2D..2 6369us+: deactivate_task <java-6629> (-4 1)
> <idle>-0 2D..2 6404us!: thread_return <java-6629> (-4 20)
> <idle>-0 2DN.1 6584us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()

indeed - basically the mm semaphore is a common resource here. I suspect
you'll see somewhat better numbers by using idle=poll or idle=mwait (or
are using those options already?).

(could you send me the whole trace if you still have it? It would be
interesting to see a broader snippet from the life of individual java
threads.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/