Re: MD/RAID time out writing superblock

From: Mark Lord
Date: Thu Sep 17 2009 - 09:35:28 EST


Tejun Heo wrote:
Hello,

Chris Webb wrote:
Hi Tejun. Thanks for following up to this. We've done some more
experimentation over the last couple of days based on your
suggestions and thoughts.

Tejun Heo <tj@xxxxxxxxxx> writes:
Seriously, it's most likely a hardware malfunction although I can't tell
where the problem is with the given data. Get the hardware fixed.
We know this isn't caused by a single faulty piece of hardware,
because we have a cluster of identical machines and all have shown
this behaviour. This doesn't mean that there isn't a hardware
problem, but if there is one, it's a design problem or firmware bug
affecting all of our hosts.

If it's multiple machines, it's much less likely to be faulty drives,
but if the machines are configured mostly identically, hardware
problems can't be ruled out either.

There have also been a few reports of problems which look very
similar in this thread from people with somewhat different hardware
and drives to ours.

I wouldn't connect the reported cases too eagerly at this point. Too
many different causes end up showing similar symptoms especially with
timeouts.

The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
from the regular IO paths or md. It's probably being issued via SG_IO
from userland. These failures don't affect normal operation.
[...]
Oooh, another possibility is the above continuous IDENTIFY tries.
Doing things like that generally isn't a good idea because vendors
don't expect IDENTIFY to be mixed regularly with normal IOs and
firmwares aren't tested against that. Even smart commands sometimes
cause problems. So, finding out the thing which is obsessed with the
identity of the drive and stopping it might help.
We tracked this down to some (excessively frequent!) monitoring we
were doing using smartctl. Things were improved considerably by
stopping smartd and disabling all callers of smartctl, although it
doesn't appear to have been a cure. The frequency of these timeouts
during resync seems to have gone from about once every two hours to
about once a day, which means we've been able to complete some
resyncs whereas we were unable to before.

That's interesting. One important side effect of issuing IDENTIFY is
that they will serialize command streams as they are not NCQ commands
and thus could change command patterns significantly.
..

SMART is the opcode that is most frequently implicated here, not IDENTIFY.
Note that even a barrier FLUSH CACHE is non NCQ and will serialize the stream.

Cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/