Re: questions about new buffer hash

David S. Miller (davem@redhat.com)
Fri, 29 Oct 1999 14:09:54 -0700


Date: Fri, 29 Oct 1999 18:32:34 +0200 (CEST)
From: Andrea Arcangeli <andrea@suse.de>

As you mention "tedious analysis", would it be possible to see it?

First I modified the kernel such that I could dump the active buffer
head state into user space at any point in time. I created a new
procfs node, /proc/bcache, which allow this. It dumped raw structures
which presented, for each currently active buffer head, only the
elements of the buffer head structure which the hash function cares
about.

Next, using this kernel, I configured several block device
configurations to get a decent mix of block device numbers. For
the most part the configurations were composed of varying ratio'd
combinations of SCSI disks, IDE disks, and RAID volumes of the
previous two.

On the configurations mentioned I would run two sets of tests to
fill up the buffer cache. The first set would be to fill the disks
up with tons of directories and other metadata, then run a
"find -type f" to fill as much of main memory with buffer cache
data as I could. The second test used "dbench X" where X was chosen
such that 2.2.x (which doesn't have your LRU changes nor the new
page cache stuff) only got minimal memory pressure and main memory
held most of the written dirty data and metadata created by the
benchmark.

During 1 second intervals, and also right after each test set,
buffer cache snapshots were taken using my /proc/bcache interface
and stored into files.

Next using this test data acquired from the machines I wrote a
simulator. Into which various hash functions could be plugged
in and various statistics about the distribution of the hash
input bits could be obtained.

The simulator would simply use the hash function you selected,
the hash table size you tell it, and build a hash table from the
dump file to get a snapshot of buffer cache hash table state
for any one of the dumps obtained during the test runs done above.
It would print out all sorts of useful statistics such as longest
hash chain length, shorted, average length, etc.

At this point I had all of the data and tools necessary to perform
the analysis.

The next task of course was hash function experimentation and trying
to quantize something about the behavior of the bits in the hash input
values.

The simulator was first given the old buffer cache hash function, plus
another function which just took the device/blocknr and jumbled all of
the bits around evenly into the hash modulo bits. The latter was
found to act better than the existing hash, but still left a few
medium to long sized chains during the simulations.

To come up with the final hash function seen in the code right now
I analyzed the bit changing distribution of each bit in each hash
function input value from all the buffer cache state dumps.

Each bit in each of the two inputs was assigned a "changing priority"
by the simulator. A bit which changed more often was assigned a
higher priority. So for example a bit which was 0 half of the time
and 1 half of the time would have the highest possible priority.

Bits which changed rarely or not at all are of no use to the hash,
since they will not change the value much if at all. Bits which have
a high rate of change, if possitioned in non-conflicting spots within
the hash, will aide the distribution of the hash function.

With that in mind, and my bit changing priorities, I decided upon the
positions where block encompassing high priority bit sets belonged
within the hash modulo encompassing bits. This is how the hash
function you see was derived.

Finally, I took this new hash function and plugged it into the
simulator. It performed better than the other two test hash
functions, and had the best distribution for all hash table size
selections. On average the distribution was at worst %92 off
of an ideal perfect even distribution. No hash function is perfect,
and perfection was not what was strived for here.

I enclose below the dump hash analysis program in it's utter
RAW form I left it in when I finished my buffer cache hash
function tuning.

The hash table size is chosen by the HSHIFT macro, and the hash
function is controlled by the HASH macro. Like I said this thing
is in a very RAW form, which was fine for me since I had this
constantly recompiled and in an editor during my work which is
all I needed. I could turn on and off various forms of statistics
as I needed them using preprossor directives.

I hope this satisfies all of your curiosities Andrea.

-------------------- bcdump.c --------------------
#include <stdio.h>
#include <stdlib.h>

struct bcent {
int dev, size, block;
};

struct devent {
int dev, count;
};

static void print_dev_stats(struct bcent *bcp, int num)
{
struct devent *dbase = malloc(num * sizeof(struct devent));
int i, dnum;

dnum = 0;
for(i = 0; i < num; i++, bcp++) {
if (dnum == 0) {
dbase[dnum].dev = bcp->dev;
dbase[dnum].count = 1;
dnum++;
} else {
int dev= bcp->dev;
int x;
for(x = 0; x < dnum; x++) {
if (dbase[x].dev == dev) {
dbase[x].count++;
goto next;
}
}
dbase[dnum].dev = dev;
dbase[dnum].count = 1;
dnum++;
}
next:
}
for(i = 0; i < dnum; i++) {
printf("dev[%08x] count %d\n",
dbase[i].dev, dbase[i].count);
}
free(dbase);
}

static int hash2ints(int x, int y)
{
int hash = 0;

hash = x - 1;
hash = ((hash * 7) + (x >> 8)) - 1;
hash = ((hash * 7) + (x >> 16)) - 1;
hash = ((hash * 7) + (x >> 24)) - 1;
hash = ((hash * 7) + y) - 1;
hash = ((hash * 7) + (y >> 8)) - 1;
hash = ((hash * 7) + (y >> 16)) - 1;
hash = ((hash * 7) + (y >> 24)) - 1;

return (hash);
}

#if 0
#define HASH(bcp) (hash2ints((bcp)->dev, (bcp)->block))
#else
#define HASHDEV(__X) (((__X) << (HSHIFT - 6)) ^ ((__X) << (HSHIFT - 9)))
#if 1
#define HASH(bcp) (HASHDEV((bcp)->dev) ^ ((((bcp)->block << (HSHIFT - 6))) ^ (((bcp)->block >> 13)) ^ (((bcp)->block) << (HSHIFT - 12))))
#else
#define HASH(bcp) (HASHDEV((bcp)->dev) ^ ((((bcp)->block << 7)) ^ (((bcp)->block >> 8)) ^ (((bcp)->block))))
#endif
#endif

#define HSHIFT (15)
#define HENTS (1 << HSHIFT)

static void print_hash_table(struct bcent *bcp, int num)
{
int *hlen = malloc(HENTS * sizeof(int));
int *htotals = malloc(num * sizeof(int));
int hmask = (HENTS) - 1;
int i, shortest_chain, longest_chain, prev;
int smallest_block, largest_block;
int block_entropy_zero[32], block_entropy_one[32];
int dev_entropy_zero[32], dev_entropy_one[32];

for (i = 0; i < 32; i++)
block_entropy_zero[i] =
block_entropy_one[i] =
dev_entropy_zero[i] =
dev_entropy_one[i] = 0;
smallest_block = 0x7fffffff;
largest_block = -1;
for(i = 0; i < (hmask + 1); i++)
hlen[i] = 0;
for(i = 0; i < num; i++, bcp++) {
int ent = (HASH(bcp) & hmask);
hlen[ent]++;
if (bcp->block > largest_block)
largest_block = bcp->block;
if (bcp->block < smallest_block)
smallest_block = bcp->block;
for(ent = 0; ent < 32; ent++) {
if ((1 << ent) & bcp->block)
block_entropy_one[ent]++;
else
block_entropy_zero[ent]++;
if ((1 << ent) & bcp->dev)
dev_entropy_one[ent]++;
else
dev_entropy_zero[ent]++;
}
}
#if 1
for (i = 0; i < num; i++)
htotals[i] = 0;
for (i = 0; i < (hmask + 1); i++)
htotals[hlen[i]]++;
prev = 0;
for (i = 0; i < num; i++) {
if (htotals[i]) {
printf("len(%d): num %8d\t[%8d]\n",
i, htotals[i], htotals[i] * i);
if (i)
prev += (htotals[i] * i);
}
}
printf("Total %d (largest block %d smallest block %d\n",
prev, largest_block, smallest_block);
printf("BLOCK ONE: [");
for (i = 0; i < 32; i++)
printf("%6d", block_entropy_one[i]);
printf("]\nBLOCK ZERO: [");
for (i = 0; i < 32; i++)
printf("%6d", block_entropy_zero[i]);
printf("]\nDEV ONE: [");
for (i = 0; i < 32; i++)
printf("%6d", dev_entropy_one[i]);
printf("]\nDEV ZERO: [");
for (i = 0; i < 32; i++)
printf("%6d", dev_entropy_zero[i]);
printf("]\n");
#if 0
for(i = 0; i < num; i++) {
int j, printed = 0;
for (j = 0; j < (hmask + 1); j++) {
if (hlen[j] == i) {
if (!printed) {
printf("len(%d): [", i);
printed = 1;
}
printf("%d ", j);
}
}
if(printed)
printf("]\n");
}
#endif
#else
prev = -1;
for(i = 0; i < (hmask + 1); i++) {
if (hlen[i]) {
if (prev == hlen[i]) {
printf("%d ", i);
} else {
prev = hlen[i];
if (i)
printf("]");
printf("\nhash[%d]: len %d [",
i, hlen[i]);
}
}
}
printf("]\n");
#endif
free(hlen);
free(htotals);
}

int main(int argc, char **argp, char **envp)
{
struct bcent *bcbase, *bcp;
char *fname;
int fd, n, resid, total;

if (argc != 2)
exit(99);

fname = argp[1];
fd = open(fname, 0);
if (fd < 0)
exit(98);
bcbase = malloc(128 * 1024 * sizeof(struct bcent));
if (bcbase == NULL)
exit(97);
bcp = bcbase;
resid = (128 * 1024 * sizeof(struct bcent));
while((n = read(fd, bcp, resid)) > 0) {
resid -= n;
bcp += (n / sizeof(struct bcent));
}
if (n < 0)
exit(96);
total = (((128 * 1024 * sizeof(struct bcent)) - resid) / sizeof(struct bcent));
printf("Read %d buffer cache entries\n", total);
print_dev_stats(bcbase, total);
print_hash_table(bcbase, total);

free(bcbase);
close(fd);
exit(0);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/