Improved hash function

From: Khovayko, Oleg (NIH/NLM/NCBI) <olegh@dont-contact.us>
Date: Fri, 22 Oct 2004 13:04:05 -0400

Dear SQUID developers!

I not sure, do I want to join your team or not.

Hovewer, I make my attention to your product, and decided to test it in the
my "torture room".
After source analysing, I've found weakness of hash functions, whose are you
use in the project.

Because of mostly of your internal searches based on that functions, hash
engine (by my opinion) -
very critical part of your product.

Now I'll explain weakness of hash function, which are you use (file:
lib/hash.c):

1.
hash_string(const void *data, unsigned int size);
This isn't serious hash function - its hash value depends only of string
charset, and independent of chars order in the string.
Hence, that function can returns same hash value for "abcd","dcba","acbd",
etc.

2.
hash4(const void *data, unsigned int size);
This is much better hash function, and has good randomization.
However, it uses linear hashing algorithm (h = h * 33 + c), and can be
catched by set of pairs of characters, like
"0z", "1Y", "28", which generates same hash value in any combinations, for
example:
"280z1Y", "0z1Y28"
Also, this function calls STRLEN, and by that way, twice scans input string.
By this way, wining of "switch unlooping" is going to below the zero.

----------------

I want to offer you my own hash function, which has:
 - good randomization (like system rand())
 - can't be catched by primitive ways, like [1,2]
 - works 20% faster, than hash4 (compared with gcc -O4)
 - has compact and short code (8 lines, vs 50 lines in hash4 )

Following - code of my hash function:

---------------

unsigned int oleg_h(const char *buf, unsigned int size) {
  const unsigned int *tab = (const unsigned int*)oleg_h;
  unsigned register rc = 0x5f5f5f5f;
  unsigned register char c;
  while(c = *buf++)
     rc = ((rc << 7) | (rc >> (32 - 7))) + (tab[(unsigned char)rc + c] ^ c);

  return rc % size;
}

---------------

Limitation:
My function uses its own binary code (and some bytes follow) as
randomization table.
Hence - hash values for same strings can be changed after program rebuild.

---------------

If you have questions or comments - welcome.

Oleg Khovayko
Received on Fri Oct 22 2004 - 17:01:03 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Oct 31 2004 - 12:00:02 MST