Re: Btrfs: broken file system design (was Unbound(?) internal fragmentationin Btrfs)

From: Edward Shishkin
Date: Fri Jun 18 2010 - 15:30:53 EST


Jamie Lokier wrote:
Edward Shishkin wrote:
If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

Hello Jamie.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Which property is robust?

Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records. It is not true; if Knuth said
so, Knuth was mistaken.

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length records don't
have any sane boundary for internal fragmentation. The common idea is that if we
don't want Btrfs to be in infinite development stage, then we should choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in
future.

This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] -> byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business. Those are trickier,
but as far as I know, btrfs doesn't use them.

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
"file systems".

Btrfs provides many useful features that other filesystems don't. We
definitely need it, or something like it. You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.

Frankly, I would like to believe to such end, taking into account amount of my
contributions to the Btrfs project. At least to make sure I didn't do useless
work..

If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well,

Because of top-down balancing. It doesn't like "clever" things like "splitting"
and "merging". Currently top-down works properly only with stupid classic
Bayer's B-trees.

and then propose a fix.

I'll try to help, but I am rather pessimistic here: working out algorithms is
something, which doesn't like timelines..

Note that it's not necessarily a problem to have a few nodes with low
utilisation. Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]

Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).

The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.

IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?

[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).

At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!

It's also faster. The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/