Re: PATCH to pre-patch-2.1.45: clean_inode needs to reset i_writecount

Linus Torvalds (torvalds@transmeta.com)
Thu, 10 Jul 1997 22:24:19 -0700 (PDT)


On Fri, 11 Jul 1997, Theodore Y. Ts'o wrote:
>
> You have to write out an ext2 inode after deleting it anyway, to
> decrement the link count to zero. Otherwise, e2fsck, when recovering
> the filesystem, will see a inode which is disconnected from the
> filesystem and assume it was caused by a directory getting smashed, and
> restore the inode to /lost+found.

I think it should at least look at the inode bitmap too: if it can't find
any reference to the inode in the directory tree, and the bitmap indicates
that the inode isn't in use, then the case is pretty clear, imnsho.

So in fact the only time when i_nlink makes any difference would be when
the inode bitmap was corrupted _and_ the directory structure was
corrupted, at which point I don't see all that much reason to trust
i_nlink either..

> Now, you could use other methods for determining whether or not an inode
> is in use --- for example, you could use the inode bitmap. However,
> this increases the requirement that the inode bitmap field be correct,
> which it might not be. If an inode bitmap block gets zero'ed out, huge
> numbers of files would simply disappear.

Only if the directory tree also disappeared.

> Inode is in use IFF (inode is marked in use in the inode bitmap)
> OR (inode is referenced in the directory
> subtree directly reachable from the
> root directory)

Right.

> This still might cause you to lose precious data in an unclean shutdown
> if a directory gets corrupted (so that some files get disconnected from
> the root) AND the inode bitmap is somehow corrupted or not written out
> to disk before the crash.

Sure, you can lose data, but fsck was never meant to be a backup tool.
Fsck should do the best it can, but a filesystem that is designed for fsck
is a filesystem that is designed to fail.

> Worse still, thus hueristic makes a complete hash of e2fsck's current
> algorithms, which work by assuming it can determine whether or not an
> inode is in use (and therefore what blocks are in use) without reference
> to the directory hierarchy. Making this change would require a major
> rewrite of e2fsck, and all existing e2fscks would not be able to handle
> ext2 filesystems which had been written to by kernels who were modified
> to have the behaviour you proposed.

The basic heuristic would still be valid: find out which inodes are in use
by looking at the inode bitmap, and start off with that. Algorithm exactly
the same as testing "i_link > 0".

In fact, the semantic content of the inode bitmap and the test "i_link >
0" is _exactly_ the same, so this part of fsck doesn't change at all
(except it is speeded up a lot when you have lots of free inodes, because
you don't even have to read those inodes in if they aren't marked in the
bitmap).

Then you do the directory tree walk, same as before. The normal case is
that this will match with what you got in the earlier stage, and
everything is fine.

However, the you _do_ find an inode that doesn't match with the directory
tree, that's when things get interesting. Either you find an inode that
you haven't seen before, in which case you do roughly:

- read the inode.
- check the inode internal state for consistency. For example:
- check the data blocks it says it has: are they marked in use in the
block bitmap? If the inode is marked unused in the inode bitmap, and
the data blocks it claims to have are _also_ marked unused (or
duplicate from an inode that we've seen) then we can assume that this
inode really was deleted, and it's just the directory entry that is
obsolete. Fix: clear the directory entry.
- if the internal inode state checks out, it's safe to add it back to
the filesystem.

or you find an inode that was marked in use, but you never found it in the
directory tree, in which case you again check the internal consistency of
the inode and if everything checks out you add it to lost+found, otherwise
you just clear the in-use bit.

> If we were living to live with the consequences of using the above
> hueristic (instead of the much simpler and more robust strategy of
> depending on the inode link count field), then yes, we could do what you
> wanted --- although note that it was always two disk blocks being
> updated, since you always had to remove the directory entry. However,
> do file deletions happen often enough that it's really worth the cost of
> optimizing for them?

Yes. Our current "rm -rf xxx" really sucks performance-wise. I still
remember what it was like on the minix filesystem where it was pretty darn
instantaneous to delete a full directory tree: it takes _ages_ on ext2fs
compared to that. And don't tell me my directories tend to be larger these
days: that's certainly true, but my CPU and disks tend to be so much
faster that it should more than make up for it.

> P.S. A number of people have found the dtime field very useful for
> recovering deleted files after a mistaken "rm -rf" command.

Actually, because the minixfs didn't bother to write out the inodes after
it deleted them, it was _very_ simple to recover from "rm -rf". I've
actually done this once when I removed my linux source directory by
mistake once:

- find the inode that was the "linux" directory inode (not very hard:
because the inode was never written out when it was deleted, all the
information was still up-to-date)
- create a new directory entry pointing to the "linux" inode
- run fsck

Voila, the _whole_ tree came back. NOTHING was lost. Because the inode
wasn't actually written out to disk when it was deleted, all the
information was still there: including the full directory tree (because
the data blocks were also marked clean when they were deleted, so all the
directory entry zeroing that had happened recursively never hit the disk
at all).

The only thing fsck needed to do was to mark the inodes (and the blocks in
the inodes) in use again when it noticed that they had a pointer to them
in the directory tree.

In contrast, when you write out the dtime, you also overwrite the old
inode information, making it much harder to recreate the file again. All
your data block pointers will be zero because the inode was truncated.

Now, I'm not claiming that the minix fsck was good: it was a piece of
rubbish compared to your fsck.ext2. It wasn't very clever at all: it
tended to give up way too easily (no lost+found), and in general it was
just a bad parody of what fsck should be like. But that doesn't mean that
the basic filesystem strategy was bad ;)

Final comment: I actually agree with you somewhat: using i_nlink is an
added piece of information, and I can see that you like to use it in fsck.
That doesn't change the fact that I _hate_ it that the normal run-time
operations are slowed down for a feature that _may_ under some
circumstances result in a better fsck. I'd much rather have a faster
filesystem.

I take the RISC approach: make the default case go fast, and make fsck a
bit slower and less reliable. I don't think anybody seriously thinks that
fsck is an alternative to backups anyway.

Linus