[BUG] spinlock lockup on CPU#0

From: Fabio Coatti
Date: Mon Mar 30 2009 - 13:36:23 EST


Hi all, I've got the following BUG: report on one of our servers running
2.6.28.8; some background:
we are seeing several lockups in db (mysql) servers that shows up as a sudden
load increase and then, very quickly, the server freezes. It happens in a
random way, sometimes after weeks, sometimes very quickly after a system
reboot. Trying to discover the problem we installed latest (at the time of
test) 2.6.28.X kernel and loaded it with some high disk I/O operations (find,
dd, rsync and so on).
We have been able to crash a server with these tests; unfortunately we have
been able to capture only a remote screen snapshot so I copied by hand
(hopefully without typos) the data and this is the result is the following:

[<ffffffff80213590>] ? default_idle+0x30/0x50
[<ffffffff8021358e>] ? default_idle+0x2e/0x50
[<ffffffff80213793>] ? c1e_idle+0x73/0x120
[<ffffffff80259f11>] ? atomic_notifier_call_chain+0x11/0x20
[<ffffffff8020a31f>] ? cpu_idle+0x3f/0x70
BUG: spinlock lockup on CPU#0, find/13114, ffff8801363d2c80
Pid: 13114, comm: find Tainted: G D W 2.6.28.8 #5
Call Trace:
[<ffffffff8041a02e>] _raw_spin_lock+0x14e/0x180
[<ffffffff8060b691>] _spin_lock+0x51/0x70
[<ffffffff80231ca4>] ? task_rq_lock+0x54/0xa0
[<ffffffff80231ca4>] task_rq_lock+0x54/0xa0
[<ffffffff80234501>] try_to_wake_up+0x91/0x280
[<ffffffff80234720>] wake_up_process+0x10/0x20
[<ffffffff803bf863>] xfsbufd_wakeup+0x53/0x70
[<ffffffff802871e0>] shrink_slab+0x90/0x180
[<ffffffff80287526>] try_to_free_pages+0x256/0x3a0
[<ffffffff80285280>] ? isolate_pages_global+0x0/0x280
[<ffffffff80281166>] __alloc_pages_internal+0x1b6/0x460
[<ffffffff802a186d>] alloc_page_vma+0x6d/0x110
[<ffffffff8028d3ab>] handle_mm_fault+0x4ab/0x790
[<ffffffff80225293>] do_page_fault+0x463/0x870
[<ffffffff8060b199>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8060bf52>] error_exit+0x0/0xa9

The machine is a dual 2216HE (2 cores) AMD with 4 Gb ram; below you can find
the .config file. (from /proc/config.gz)

we are seeing similar lockups (at least similar for the results) since several
kernel revisions (starting from 2.6.25.X) and on different hardware. Several
machines are hit by this, mostly databases (maybe for the specific usage, other
machines being apache servers, I don't know).

Could someone give us some hints about this issue, or at least some
suggestions on how to dig it? Of course we can do any sort of testing and
tries.

Thanks for any answer.



Attachment: config_bug.gz
Description: GNU Zip compressed data