Re: 2.6.18-rc3-g3b445eea BUG: warning at /usr/src/linux-git/kernel/cpu.c:51/unlock_cpu_hotplug()

From: Dave Jones
Date: Fri Aug 04 2006 - 22:08:43 EST


On Fri, Aug 04, 2006 at 08:31:42PM -0400, Dave Jones wrote:
> On Fri, Aug 04, 2006 at 06:24:00PM -0400, Dave Jones wrote:
> > > DaveJ, any ideas?
> >
> > So I'm at a loss to explain why this made this bug suddenly appear, but
> > I don't think this is anything new. Fixing it however...
> > I'm looking at it, but I don't see anything obvious yet.
>
> Ok, debugging this is getting ridiculous. I dug out a core duo,
> built a kernel and watched it spontaneously reboot before it even got to init.

Hmm, this went away after a ccache flush & a rebuild. Old crud left behind
from earlier debugging I guess.

So I added this diff..

diff --git a/kernel/cpu.c b/kernel/cpu.c
index f230f9a..31210a0 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -31,6 +31,14 @@ void lock_cpu_hotplug(void)
{
struct task_struct *tsk = current;

+ if (jiffies > HZ * 60) {
+ printk ("CPU%d called lock_cpu_hotplug() for app %s. recursive_depth=%d\n",
+ smp_processor_id(),
+ current->comm,
+ recursive_depth);
+ dump_stack();
+ }
+
if (tsk == recursive) {
static int warnings = 10;
if (warnings) {
@@ -42,18 +50,30 @@ void lock_cpu_hotplug(void)
return;
}
mutex_lock(&cpu_bitmask_lock);
+ if (jiffies > HZ * 60)
+ printk ("%s acquired cpu_bitmask_lock\n", current->comm);
recursive = tsk;
}
EXPORT_SYMBOL_GPL(lock_cpu_hotplug);

void unlock_cpu_hotplug(void)
{
+ if (jiffies > HZ * 120) {
+ printk ("CPU%d called unlock_cpu_hotplug() for app %s. recursive_depth=%d\n",
+ smp_processor_id(),
+ current->comm,
+ recursive_depth);
+ dump_stack();
+ }
+
WARN_ON(recursive != current);
if (recursive_depth) {
recursive_depth--;
return;
}
mutex_unlock(&cpu_bitmask_lock);
+ if (jiffies > HZ * 120)
+ printk ("%s released cpu_bitmask_lock\n", current->comm);
recursive = NULL;
}
EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);

In the hope to better understand what the hell is going on.
Here's what happened when I started the cpuspeed service

(That DWARF unwinder stuck thing is really darned annoying btw)

CPU1 called lock_cpu_hotplug() for app modprobe. recursive_depth=0
[<c0104edc>] show_trace_log_lvl+0x58/0x152
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c01499c8>] stop_machine_run+0x11/0x38
[<c0141943>] sys_init_module+0x166d/0x1847
[<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c01499c8>] stop_machine_run+0x11/0x38
[<c0141943>] sys_init_module+0x166d/0x1847
[<c0103db9>] sysenter_past_esp+0x56/0x8d
modprobe acquired cpu_bitmask_lock

CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0
[<c0104edc>] show_trace_log_lvl+0x58/0x152
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c029fbae>] store_scaling_governor+0x142/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c029fbae>] store_scaling_governor+0x142/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d
cpuspeed acquired cpu_bitmask_lock

I have questions.

1. How come the printk at the end of unlock_cpu_hotplug() never
got hit when modprobe got to the end of stop_machine_run() ?

2. How come another process acquired this mutex when the previous
one aparently never released it ?

Moving on ...

CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0
[<c0104edc>] show_trace_log_lvl+0x58/0x152
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c0132f3c>] __create_workqueue+0x52/0x122
[<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand]
[<c029f7b6>] __cpufreq_governor+0x57/0xd8
[<c029f985>] __cpufreq_set_policy+0x14e/0x1bc
[<c029fbc5>] store_scaling_governor+0x159/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c0132f3c>] __create_workqueue+0x52/0x122
[<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand]
[<c029f7b6>] __cpufreq_governor+0x57/0xd8
[<c029f985>] __cpufreq_set_policy+0x14e/0x1bc
[<c029fbc5>] store_scaling_governor+0x159/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d
Lukewarm IQ detected in hotplug locking
BUG: warning at kernel/cpu.c:46/lock_cpu_hotplug()
[<c0104edc>] show_trace_log_lvl+0x58/0x152
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8fc>] lock_cpu_hotplug+0x72/0xbf
[<c0132f3c>] __create_workqueue+0x52/0x122
[<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand]
[<c029f7b6>] __cpufreq_governor+0x57/0xd8
[<c029f985>] __cpufreq_set_policy+0x14e/0x1bc
[<c029fbc5>] store_scaling_governor+0x159/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8fc>] lock_cpu_hotplug+0x72/0xbf
[<c0132f3c>] __create_workqueue+0x52/0x122
[<f8faf34b>] cpufreq_governor_dbs+0x9f/0x2c3 [cpufreq_ondemand]
[<c029f7b6>] __cpufreq_governor+0x57/0xd8
[<c029f985>] __cpufreq_set_policy+0x14e/0x1bc
[<c029fbc5>] store_scaling_governor+0x159/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d


Ok, now that it's done insulting my IQ, recursive_depth should be 1.
However ...


CPU1 called lock_cpu_hotplug() for app cpuspeed. recursive_depth=0
[<c0104edc>] show_trace_log_lvl+0x58/0x152
[<c01054c2>] show_trace+0xd/0x10
[<c01055db>] dump_stack+0x19/0x1b
[<c013e8c3>] lock_cpu_hotplug+0x39/0xbf
[<c029fbae>] store_scaling_governor+0x142/0x1a3
[<c029f1a5>] store+0x37/0x48
[<c01a6561>] sysfs_write_file+0xab/0xd1
[<c016f99f>] vfs_write+0xab/0x157
[<c016ffe4>] sys_write+0x3b/0x60
[<c0103db9>] sysenter_past_esp+0x56/0x8d


WTF is going on here ?
In the whole log, recursive_depth is always 0.


At this stage, it gets into a death spiral printing out the same
crap over and over again. I put the full log at
http://people.redhat.com/davej/hotplug-wtf.txt

Note how the acquire/releases don't match up at all. Scary.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/