Re: Call to atention about using hash functions as content indexers(SCM saga)

From: Richard B. Johnson
Date: Tue Apr 12 2005 - 07:18:56 EST


On Mon, 11 Apr 2005, Petr Baudis wrote:

Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy <piotr@xxxxxxxxxx> told me that...
Hi

Hello,

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not.

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.

(ii) In git-pasky, there's (turnable off) detection of collisions.

(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.

(iv) You fail to propose a better solution.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

This is a standard, free (no patents) hash-function that works.
The fact that somebody can write a program, designed to create
collisions, and demonstrate that after many weeks of processing,
iteratively building upon the previous result, and finally
cause a collision, really isn't relevant for this application.

There is a good possibility that there are no hash-functions
that can't be broken.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/