Re: Transparent compression in the FS

From: Eli Carter
Date: Fri Oct 17 2003 - 11:24:01 EST


John Bradford wrote:
The upshot of all that would be that if you needed space, it would be
there, (just overwrite the uncompressed versions of files), but until
you do, you can access the uncompressed data quickly.

You could even take it one step further, and compress files with gzip
by default, and re-compress them with bzip2 after long periods of
inactivity.

Note that a file compressed with bzip2 is not necessarily smaller than the same file compressed with gzip. (It can be quite a bit larger in fact.)


Have you noticed that with real-life data, or only test cases?

Real-life data. I don't remember the exact details for certain, but as best as I can recall: I was dealing with copies of output from build logs, telnet sessions, messages files, or the like (i.e. text) that were (many,) many MB in size (and probably highly repetitititititive). I wound up with a loop that compressed each file into a gzip and a bzip2, compared the sizes, and killed the larger. There were a number of .gz's that won. (I have also read that gzip is better at text compression whereas bzip2 is better at binary compression. No, I don't remember the source.)

But that is immaterial... You have to deal with the case where the 'better' algorithm gives 'worse' results (by size). Keep in mind that some data won't compress at all (for a given algorithm), and winds up needing more space in the compressed form. (In which case we add a byte to say "this is not compressed" and keep the original form.)

uncompressed -> gzip; gzip -> bzip2 would be by far the normal case
But, sometimes gzip can't get it any smaller, or would increase the size. (Keep in mind we may be storing a file that is already compressed...)

So your scheme needs to note when compression fails so it doesn't try again, so we see:

uncompressed -> gzip or uncompressed(gzip failed)
gzip -> bzip2 or gzip(bzip2 failed)
uncompressed(gzip failed) -> bzip2 or uncompressed(bzip2 failed)

If it were me, I'd do it with one compression algorthim as a proof-of-concept, then add a second, and then generalize it to N cases (which would not be hard once the 2 cases was done).

But I must say, I like your idea of keeping the uncompressed form around until we need the space. (I'd also want to track reads separately from writes.)

Eli
--------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter(a)inet.com `-------------------------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/