John b. matthe2009-07-12 14:12:02

I think all of the primitive wrappers (except Double) have trivially

perfect integer hashes, as they wrap distinct values. Patricia was

perhaps referring to algorithms of the kind discussed here:

The jumble algorithm relies on a hashed map for efficiency, but a

perfect hash is not essential. I recently implemented the algorithm

in both Java and Ada using library routines:

I was intrigued to compare the two languages’ library implementations

of a general purpose string hash. Using a multiplicative function with

an initial value of zero, both offer excellent performance. Java uses a

multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of

eight on an unsigned (modular) value whose size is implementation

defined (e.g. mod 2**32):

Followup-To header not set.]

—

John B. Matthews

trashgod at gmail dot com

Patricia shana2009-07-12 14:12:03

Yes, that’s the sort of thing I meant.

The Long hash code is also not perfect, and cannot be, because there are

more longs than ints.

Patricia

Mark space2009-07-12 14:12:04

I remember in school we implemented a minimal perfect hash function on a

given set of keys: the keywords in the Pascal language. We were given

the parameters of the algorithm: use the length of the keyword, the

first character and the last character only.

The concept was obviously to show how a fast look-up could be done when

implementing a compiler, but the hash function itself was interesting at

the time.

Lew2009-07-12 14:12:08

Because it has constant-time performance for get() with the String hash code.

The Java String hash is just about verbatim from Knuth. It has the virtue of

being a well-studied algorithm that has been shown to have good distribution

characteristics.

I’m not so sure about the Ada one. I haven’t studied this in a while, but I

seem to recall that power-of-two multipliers mod 2**32 don’t have as good a

distribution. Perhaps someone could correct me on that.

—

Lew

Giovanni azua2009-07-12 14:12:09

”Giovanni Azua”

“John B. Matthews”

“Patricia Shanahan”

To achieve the perfect, optimal and ideal no collisions Hash table “perfect

Hash table” these are the two factors that play a role:

1-. the “perfect” property of the hash function as defined in the wiki page

cited above

2-. the size of the elements set

If #2 goes arbitrarily large meaning allocating a hash table of the size of

that elements set becomes prohibitive memory-wise then it does not really

matter how perfect the hash function is, there will be collisions anyway and

an additional search algorithm will be needed to disambiguate the colliding

elements that fall under the same buckets. This is what I think John

Matthews hasn’t realized yet about his implementation of the Jumble

algorithm. For a realistic dataset he will be returning elements that do not

match the of input (String built on top of the sorted character input) but

matches that happen to collide under the same bucket.

e.g. trying to allocate a Hash table of the size ( Integer.MAX_VALUE –

Integer.MIN_VALUE) would be prohibitive unless you place unrealistic

requirements so you have the Integer with a perfect hash function but you

will not get a “perfect Hash table”. It is of course worse for the String

class where neither #1 nor #2 are fulfilled.

Character class offers the perfect hash function but since the elements set

is discrete i.e. 26 elements http://en.wikipedia.org/wiki/English_alphabet

you can build a perfect Hash structure like the one I described and edited

by Tom McGlynn.

Best regards,

Giovanni

Giovanni azua2009-07-12 14:12:10

I am sorry, this is wrong. Matthews’s Jumble implementation is correct and

does not have the problem I described here.

Best regards,

Giovanni

Tom anderson2009-07-12 14:12:11

John, i think you’ve missed a trick there – doesn’t Rotate_Left do a

barrel shift, where the bits that fall off the left-hand end get shoved

back in at the right, rather than a plain shift? If so, the basic

algebraic structure is actually a bit different to java’s hash.

If Rotate_Left is a plain shift, then the Ada hash is based only on the

last 10 characters of the string. Multiplying by eight is like shifting

left three places; modding by 2*32 is like anding with 0xffffffff (if

you’ll pardon my verbs). When you add in a character, the rightmost bit

you can affect is bit 0, the LSB; after ten shifts of three bits, that is

now at bit thirty, and after one more, it’s at bit thirty-three, and has

vanished.

If Rotate_Left is a barrel shift, then our bit of interest doesn’t vanish,

because it comes back to the right-hand end of the hash after the eleventh

shift in position 1. Because 3 is coprime with 32, it will then proceed to

make a complete circuit of all positions in the hash over the following

twenty-one shifts, winding up back in position 0 after the thirty-second.

I have a vague worry that this means that strings built from the same

32-character blocks arranged in any order will have the same hash, but i

suspect that this is not the case, because of carries. My maths isn’t up

to proving it, though.

I still think this is a shoddy hash. It has no diffusion: if you hash two

strings which differ at one character position, the difference between the

hashes will always be confined to the sixteen bits which correspond to

where that character ended up in the rotating hash. Java’s

multiply-and-mod approach quickly spreads the bits of each character out

over the whole hash.

tom

—

Science of a sufficiently advanced form is indistinguishable from magic

John b. matthe2009-07-12 14:12:19

D’oh, you’re absolutely right. I overlooked this when I was tinkering

with my own hash function, too. Thank you for looking at this. […]

Well, it’s a legacy from GNAT.HTable.Hash, but it does very well in the

context of Ada.Containers.Hashed_Maps. Java uses a bucket list whose

size is an integral power of two, while (this implementation of) Ada

uses the next largest prime. Modeling an Ada collision counter on your

example

results for collision count, number of buckets having that count, and

the percentage of the total:

$ ./collisions

0: 218586 (55.6%)

1: 126250 (32.1%)

2: 38432 (9.8%)

3: 8362 (2.1%)

4: 1354 (0.3%)

5: 229 (0.1%)

6: 24 (0.0%)

7: 4 (0.0%)

8: 1 (0.0%)

The code is here:

John B. Matthews

trashgod at gmail dot com

## Leave a Reply

You must be logged in to post a comment.