Re: 2.6.28.9: EXT3/NFS inodes corruption

From: Sylvain Rochet
Date: Fri Aug 21 2009 - 06:52:06 EST


Hi,


On Thu, Aug 20, 2009 at 05:00:35PM -0700, Simon Kirby wrote:
> On Thu, Aug 20, 2009 at 07:19:53PM +0200, Sylvain Rochet wrote:
>
> > So, everything is fine, but the problem happened only one time on this
> > server, so we cannot conclude anything after a few weeks. However,
> > I now have physical access back, so we will switch back to the former
> > server where the problem happened quite frequently, then we will see!
>
> Not to derail the thread, but you were definitely seeing the same issues
> with stock 2.6.30.4, right?

Nope, the last issue we had came from 2.6.28.9.

We upgraded to 2.6.30.3 on the advice of Jan, then we "upgraded" to
2.6.30.3 with the first Jan's patch to add some debug output
(0001-ext3-Debug-unlinking-of-inodes.patch). Finally we upgraded to
2.6.30.4 with the first and the second Jan's patch
(0001-fs-Make-sure-data-stored-into-inode-is-properly-see.patch) to add
a smp_mb() in the unlock_new_inode() function.


> We had all sorts of corruption happening for files served via NFS with
> 2.6.28 and 2.6.29, but everything was magically fixed on 2.6.30
> (though we needed a lot of fscking). I never did track down what
> change fixed it, since it took a while to reproduce.

Same here, everything is fine since 2.6.30. We will switch back to the
quad-core server where the corruption happen(ed) in a few days. We are
now using a bi-opteron server because we suspected hardware issues on
the quad-core, the corruption happened only one time on the bi-opteron
(which is IMHO a sufficient evidence to discard hardware issue). I guess
the issue was(or is) kinda SMP related.

And yep, we also had long times playing with fsck ;-) Luckily that the
corruption only occurs on new files, and new files are mostly caches,
sessions, logs, and such, so fsck used its chainsaw on quite
not-really-important files.


> Hmm. I just noticed what seems to be a new occurrence of "deleted inode
> referenced" on a box with 2.6.30. We saw many when we first upgraded to
> 2.6.30 due to the corruption caused by 2.6.29, but those all occurred
> within a day or so and were fsck'd. I would have thought the backup
> sweeps would have tripped over that inode way before now...
>
> Just wondering if you can confirm that the errors you saw with 2.6.30.4
> were not leftover from older kernels.

The few garbaged inodes from 2.6.28.9 (and previous) were pushed to
lost+found to prevent future use of them. We do a fsck when we moved to
2.6.30.4 that fixed everything. We never had corruption yet with the
2.6.30.4.


Sylvain

Attachment: signature.asc
Description: Digital signature