Re: linux-2.1.44 on i586: Immediate crash on boot

Linus Torvalds (torvalds@transmeta.com)
Wed, 9 Jul 1997 08:43:30 -0700 (PDT)


On Wed, 9 Jul 1997, Bill Hawes wrote:
>
> You left some massive race conditions in clear_inode and iput though --
> see my recent patches against 2.0.30 for example. The attached patch
> takes care of these.
>
> The question of whether clear_inode can be called with count == 0 needs
> to be resolved. As it presently stands a filesystem without a put
> function will allow the inode count to go to 0 while remaining on the
> inuse list. This is OK, except that when you want to reuse it you need
> to call clear_inode, whcih reintroduces race conditions ...

Actually, that's not true with the new inode code. If you look carefully,
you'll notice that "clear_inode()" is never called just becaus i_count is
zero.

clear_inode() is called _only_ by filesystems that want to indicate that
the inode no longer exists (ie when the filesystem actually deletes the
inode). At that point i_count is still 1, actually, as this is done from
the "put()" routine before i_count has been decremented.

The other way the inode can migrate to the unused list is through
"invalidate_inodes()" (seldom) or "try_to_free_inodes()", which just take
a look at the in_use list and move any CAN_UNUSE() inodes into the unused
list. This operation is atomic, because CAN_UNUSE() essentially makes sure
there is _nothing_ the inode needs to care about (no nrpages, no count, no
dirty etc).

There is one known race: after iput() has done the "put()" operation, it
will do a decrement on the inode count. That is generally fine: if the
"put()" has not put the inode on the free list then the decrement is the
right thing to do, and if it _has_ put it on the free list the decrement
doesn't make any difference because i_count is no longer used.

However, there is a slight race where the filesystem puts the inode on the
free list and then that inode is immediately allocated by something else,
and now the i_count decrement is done on somebody else's inode.

This race cannot actually currently happen because filesystems do not
sleep between the clear_inode() and returning to iput() and the race is
protected by the global kernel lock (SMP) or general rescheduling rules
(UP). However, I designed the new inode code to be SMP safe even without
the current global lock, so this is a "bug" in my books and I'll have to
fix it.

I'll certainly take a look at your patches, though. Maybe I just had
overlooked something.

> One minor nit -- you init the inode semaphore in init_once, but then
> again in clean_inode. Once should be enough?

There is a lot of initialization I want to remove from clean_inode(). The
current code tries to be safe rather than clever, and I worried that maybe
somebody does a free on the inode while holding the semaphore (because the
old code used to re-initialize it). But you're probably right that it
should just be deleted. Anybody care to try and send me results?

Oh, final comments, because I really _should_ have warned people about the
problems that I knew about when I released it (most of these problems are
in 2.1.44 too):

- unmounting of filesystems does not work. The current inode.c doesn't
try to find out whether we can unmount or not, so it takes the "safe"
approach and tells the rest of the kernel that it can never unmount or
re-mount read-only. I need to go through the inodes and look that none
of them are in use..

- The dcache never free's any dcache entries it has allocated. NEVER. And
it free's the inodes that those dcache entries are associated with only
if the file is actually deleted (but even in that case the dcache entry
itself is not actually free'd).

Problem #2 is actually the major reason for #1: because of #2 we have a
lot of inodes that look like they are in use, even though they are really
only in the directory cache, and as such it's currently impossible to tell
whether a filesystem is unused (it _would_ be possible to check whether we
can remount read-only, but I decided to punt on that too until #2 is
done).

Problem #2 also means that 44 and pre-45 can be totally useless depending
on how much memory you have and what kinds of filesystem access patterns
you have. For example, for the kind of work I mostly do, #2 is usually not
a problem, because my access patterns are so regular that even though the
dcache never shrinks, it also tends to not grow very much.

Problem #1 results in all shutdowns being dirty shutdowns, so the
filesystems will be checked at the next boot. This has actually been a
"feature" for me personally because that way I can be sure that the
filesystems aren't silently and slowly being corrupted (although plain
2.1.44 can do that too - I've never seen it in pre-45, though ;)

Even despite the above two problems are major showstoppers for most
"serious" use of pre-2.1.45, I'd still like people to test it out just to
get a feel for any other problems that might be lurking. Fixing the above
two problems isn't really hard, but #2 in particular will require quite
some attention to details. I want to know whether the kernel otherwise is
reasonably stable apart from these issues..

Linus