Re: Monthly md check == hung machine; how do I debug?

From: Robin Lee Powell
Date: Tue Feb 05 2008 - 16:18:28 EST


On Wed, Feb 06, 2008 at 07:27:56AM +1100, Neil Brown wrote:
> On Tuesday February 5, rlpowell@xxxxxxxxxxxxxxxxxx wrote:
> >
> > I was able to solve the problem, however, like so:
> >
> > 132c133
> > < # CONFIG_PREEMPT_NONE is not set
> > ---
> > > CONFIG_PREEMPT_NONE=y
> > 134,135c135,136
> > < CONFIG_PREEMPT=y
> > < CONFIG_PREEMPT_BKL=y
> > ---
> > > # CONFIG_PREEMPT is not set
> > > # CONFIG_PREEMPT_BKL is not set
> >
>
> This suggests that there is some sort of race. Given that I've
> never hit it on SMP machines, it is probably a very small window
> that opens immediately after some event that triggers kernel
> preemption.
>
> The only "mdadm --monitor" does

Going to stop you right there; "mdadm --monitor" wasn't it, nor was
smartd as I thought at one point. I honestly don't know what was
triggering it, except maybe disk access. The fact that backups were
running at the same time as the sync seemed to make it happen
faster; that's the best I've got at this point.

> What sort of hardware do you have? x86? SMP or uni-processor?
> Also, exactly what kernel are you running?

rlpowell@chain> uname -a
Linux chain.digitalkingdom.org 2.6.23.1-dk3 #4 SMP Mon Feb 4 06:14:44 PST 2008 x86_64 GNU/Linux
rlpowell@chain> cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 39
model name : AMD Athlon(tm) 64 Processor 3700+
stepping : 1
cpu MHz : 2210.251
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflu
t fxsr_opt lm 3dnowext 3dnow up rep_good pni lahf_lm
bogomips : 4422.66
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc


> I might see if I can reproduce it... so if you can send me the
> broken .config, that might help too.

http://teddyb.org/~rlpowell/media/regular/config-2.6.23.1-dk2.txt

-Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/