Re: Transparent compression in the FS

From: Jeff Garzik
Date: Thu Oct 16 2003 - 18:32:56 EST


jw schultz wrote:
Because each hash algorithm has different pathologies
multiple hashes are generally better than one but their
effective aggregate bit count is less than the sum of the
separate bit counts.

I was coming to this conclusion too... still, it's safer simply to handle collisions.


The idea of this sort of block level hashing to allow
sharing of identical blocks seems attractive but i wouldn't
trust any design that did not accept as given that there
would be false positives. This means that a write would
have to not only hash the block but then if there is a
collision do a compare of the raw data. Then you have to

yep


add the overhead of having lists of blocks that match a hash
value and reference counts for each block itself. Further,
every block write would then need to include, at minimum,
relinking facilities on the hash lists and hash entry
allocations plus the inode block list being updated. Finally

Consider a simple key-value map, where "$hash $n" is the key and the value is the block of data. Only three operations need occur:
* if hash exists (highly unlikely!), read and compare w/ raw data
* write new block to disk
* add single datum to key-value index

Inode block lists would need to be updated regardless of any collision; that would be a standard and required part of any VFS interaction. And the internal workings of the key-value index (think Berkeley DB) are static, regardless of any collision.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/