Re: more on hash functions

Chuck Lever (cel@monkey.org)
Mon, 12 Apr 1999 00:30:23 -0400 (EDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Linus Torvalds: "Re: [RFC] change of lookup() method."
Previous message: Jason A. Pfeil: "US Robotics PCI 56K Modem"

On Sat, 10 Apr 1999, Paul F. Dietz wrote:
> Chuck Lever wrote:
>
> > agreed that multiplication should be avoided. however, in some cases
> > nothing else works.
>
> However, this is not one of those cases, if loads are
> much cheaper than multiplication. A quick
> and easy hash function would be to hash a 20 bit
> value by using its two 10 bit words as indices into
> two tables of randomly chosen values (in the
> range 0..PAGE_HASH_SIZE-1). Xor the two values together.

i hadn't considered trying a table-driven hash function because Knuth
poo-poo'd the idea very early in the section on hashing in Vol. 3 (i don't
remember why at the moment). so, i took your e-mail as a challenge to try
a simple table-driven hash implementation for the one cache that has been
irascibly untunable: the buffer hash. (the reason for the difficulty is
the regular way in which ext2 uses block offsets -- it makes it very hard
to find an "easy" hash function with the desirable properties).

i implemented something close to what you suggested here: a hash function
from 24 bits into 12 using an exclusive-or of two random table look-ups.
the reason for the increased size is that the buffer cache can contain
tens of thousands of objects, and to keep positive *and* negative lookups
quick, the hash chains need to be kept short (<15). to do this, you need
to have a large hash table -- 10 bits (1024 buckets) simply aren't enough.
in fact, 12 bits still aren't enough, but it works well enough for a
demonstration.

every hash function i've tried, including this one, left almost half the
buckets empty during a moderate file system load. it also compared with
other hash functions in terms of average find_buffer() iterations per
lookup request. the size distribution for this hash degenerated more
quickly than the distribution for the multiplication hash.

there are two places where this style of hash function falls down when
compared to the others.

1. random table size is huge. even for a 24-bit to 12-bit hash, the two
random tables are 2 bytes * 2 ^ 12 buckets (2 pages) each. the pair of
tables are as large as the L1 cache on some CPUs. this is not very useful
data to have chewing up CPU caches and memory bandwidth; not to mention
there are 4 useless bits in every entry (12 bits stored in a unsigned
short).

2. hash table size is inflexible. in essence, you have to maintain a
pair of random tables for each hash table size you want to use.

3. also hash tables can't be practically increased large enough for the
number of objects we want to hash. for the buffer hash, we'd like to hash
a 32-bit block number to 14 or more bits. every doubling of hash table
size means a significant decrease in average hash chain length. as well,
since even good hash functions generally only hit about half the buckets,
we want the hash table larger still (this also why one should take any
mathematical claims about the goodness of a hash function with a shaker of
salt).

a 14 bit random number table would consume 8 pages (16 total for both). if
you want 15 bits or even 16 (the size of the buffer hash table in stock
2.2.5 kernels), you're talking about an additional 64 pages total of
memory (but at least there are no wasted bits for 16 bit random tables
stored in an array of short ints!).

would you want something as useless as *random* *data* cluttering the
physical memory of a 68030-class machine? while memory accesses are
comparatively cheap on such hardware, physical RAM is usually at a
premium. thus, i would try a hash function, such as one that uses
multiplication, if it:

I. allows me to avoid the use of large random tables that would be hit
pretty hard if used by all the hash functions in the kernel

b. allows me to reduce the number of buckets in my hash tables by making
use of more buckets

i'm sure this is more than anyone cares to know... but i'm obsessive.

has anyone tried using a multiplicative hash on a 68030 Linux build to see
if it impacts performance significantly? i think its arguable whether 68K
Linux needs the scalability of large hash tables and a good multiplicative
hash, anyway. as alan pointed out, it is possible (although more
complicated) to use an architecture-specific hash function for these
hashes.

> This class of hash functions (parameterized by
> the table contents) is provably good, in the sense
> that for any two distinct keys, a randomly chosen
> hash function from this class will cause the keys
> to collide with probability 1/PAGE_HASH_SIZE
> (the class of hash functions is "universal";
> see Cormen, Leiserson, and Rivest, "Introduction
> to Algorithms", chapter 12.)
>
> Moreover, the expected behavior of hashing with
> chaining is O(n/m) time per operation for n values
> stored in a table of size m (basically, pairwise
> independence of the hash values is enough to
> ensure this.) Again, this is regardless of
> the actual keys; the randomness comes from the
> tables.

- Chuck Lever

-- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [RFC] change of lookup() method."
Previous message: Jason A. Pfeil: "US Robotics PCI 56K Modem"