Re: Volume management on Linux with the ext2fs.

Theodore Y. Ts'o (tytso@MIT.EDU)
Wed, 23 Apr 1997 10:23:14 -0400

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Carsten Gross: "Re: parport patch / lp_readback"
Previous message: Michael Brennen: "Re: __DEATH__ of SMP"

From: Miguel de Icaza <miguel@nuclecu.unam.mx>
Date: 22 Apr 1997 14:43:46 -0500

I have been considering adding to Linux volume management
capabilities, in the spirit of the IRIX volume management, where it is
possible for a file system to span multiple devices....

OK, let me make a few observations.

First of all, there are a number of places in the kernel which assume
that if a filesystem is associated with a device, it's associated with
exactly one device. Filesystems obviously don't have to be associated
with a device --- for example, NFS and SMBfs aren't associated with
devices --- however you have to write the filesystem code differently.
And the ext2 filesystem code, at a very fundamental level, assumes an
ext2 filesystem is associated with a single device.

That being said, what's wrong with using RAID-0? Well, if you haven't
made your filesystem be RAID-0 from the beginning, it's harder to turn
it into a RAID-0 configuration. However, if you do have a pre-existing
RAID-0 configuration, it's not all that hard to add an additional disk
to it. Once you add an additional disk to a RAID-0 configuration, the
trick is refrobbing the ext2 filesystem to know about the extra block
groups. I am in the process of writing a program that will allow you to
resize ext2 partitions, which will take care of this case for you quite
nicely.

The advantage of using RAID-0, then, is that you don't have to change
any kernel code --- it's merely writing a user-space program to resize
the filesystem. If we used your scheme, we would have to rewrite an
awful lot of kernel code, and then make some pretty large modification
to the user-space programs as well. So I think the RAID-0 technique is
better.

There remains the problem of if you already have a non-RAID-0
filesystem, how do you turn it into a RAID-0 device? This is soluable;
all you have to do is to resize the existing filesystem down by a few
blocks (to make room for the RAID-0 header). Then you simply need to
write a program which moves the entire filesystem down by the size of
the RAID-0 header, and then add the RAID-0 header. Again, this is quite
doable, and only requires user-space programming.

The one problem with both of your scheme and this RAID-0 scheme is what
do you do when you want to remove a physical disk from a filesystem?
The simplest case is where you want to remove the last physical disk.
Then, you will need the resize utility (which you will have already to
accomplish the RAID-0 scheme), so you can resize the filesystem down.
The resizing utility will have to migrate inodes and disk blocks out
from the physical disk, so that you will be able to remove it from the
logical volume.

The really hard case happens when you want to remove a disk from the
*middle* of a logical volume. This is where things get really painful.
It can be done with the resizing utility, at least in theory, but the
resizing utility would need to not only migrate blocks and inodes away
from the physical disk, but it would also need to renumber all of the
blocks and inode numbers for all disks *after* the physical disk, and
update inodes and directories accordingly.

This is where a redesigned scheme that tries to handle volume sets from
the very beginnning has some advantages over a RAID-0 scheme. However,
I suspect it's still less code to make the resize utility handle
removing a disk from the middle, when compared against the cost of
implementing a whole new advanced filesystem from scratch.

-----------------------------------------------------

As far as adding extra features such as logging to ext2fs, it's
something that Stephen and I have at least talked about, and I've done
some initial thinking about its design. The hardest part is that we
don't have really have the underlying support for it in the Linux
buffer/page cache layer. What is really needed is multiple buffer cache
queues for each device, representing differing levels of priority that
buffers should be flushed out to disk. So if there are blocks on the
high-priority queue, they should be processed before any blocks on the
medium priority queue, and so on. There probably should also be a
special super-low priority queue where things like atime updates would
go, where this queue would only get flushed either when (a) you need to
reclaim the buffer to get back some memory, or (b) when you unmount the
filesystem. This would make a really big difference as far as
performance on busy filesystems and prolonging the life of batteries in
a laptop computer, since atime updates would rarely get flushed to disk.

Once you have the multiple levels of priority for the write queues, it
makes life much easier for doing an efficient rollback log for ext2.
The basic idea is that rollback log allows you to undo meta-data updates
to get your disk back to a stable state. Periodically, the system will
do a commit operation where all pending meta-data is flushed out. This
needs to be done periodically because (a) you want files that are
written to be actually committed, and (b) disk blocks which are freed
due to a file deletion or truncation can't be re-used until the commit
happens.

There are of course other logging schemes which can be used, such as
those used in the lfs and other more advanced filesystems. It is often
very, very hard to get the performance of those advanced filesystem up
to those of a well-tuned traditional filesystem, especially if you're
working with a large variety of amount of free memory that you can
assume is availble for buffer caches.

As far as hashing or B-tree directory structures, and extent-based
filesystems, those are things which are possible in the ext2 filesystem
as well. There is an obvious design question about whether it's simpler
to build these sorts of things in from scratch, or take an existing
filesystem and add these features.

My opinion is that for simple things like adding hashed or B-tree
directories, or using an extent-based block encoding scheme, we're
better off simply enahncing the ext2 filesystem. Obviously, the really
esoteric features such as the log-structured filesystems and generalized
volume management (beyond just RAID-oriented schemes), you might as well
start from scratch. However, there's a cost-benefit tradeoff to these
schemes, and I think we can add a lot of really good stuff to the ext2
filesystem, while maintaining backwards compatibility.

If other people wish to work on a really advanced filesystem --- I wish
them luck. It's not an easy thing to do, though, especially in terms of
getting the reliaibility and efficiency up to acceptable levels. I for
one am very interested in seeing what they come up with.

- Ted

Next message: Carsten Gross: "Re: parport patch / lp_readback"
Previous message: Michael Brennen: "Re: __DEATH__ of SMP"