Mombu the Programming Forum sponsored links

Go Back   Mombu the Programming Forum > Programming > find words that contains some specific letters
User Name
Password
REGISTER NOW! Mark Forums Read

sponsored links


Reply
 
1 12th July 14:12
john b. matthews
External User
 
Posts: 1
Default find words that contains some specific letters


I think all of the primitive wrappers (except Double) have trivially
perfect integer hashes, as they wrap distinct values. Patricia was
perhaps referring to algorithms of the kind discussed here:
<http://en.wikipedia.org/wiki/Perfect_hash_function>

The jumble algorithm relies on a hashed map for efficiency, but a
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:
<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>

I was intrigued to compare the two languages' library implementations
of a general purpose string hash. Using a multiplicative function with
an initial value of zero, both offer excellent performance. Java uses a
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
eight on an unsigned (modular) value whose size is implementation
defined (e.g. mod 2**32):
<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

[Cross-posted to comp.lang.java.programmer and comp.lang.ada;
Followup-To header not set.]

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
  Reply With Quote


  sponsored links


2 12th July 14:12
patricia shanahan
External User
 
Posts: 1
Default find words that contains some specific letters


Yes, that's the sort of thing I meant.

The Long hash code is also not perfect, and cannot be, because there are
more longs than ints.

Patricia
  Reply With Quote
3 12th July 14:12
mark space
External User
 
Posts: 1
Default find words that contains some specific letters


I remember in school we implemented a minimal perfect hash function on a
given set of keys: the keywords in the Pascal language. We were given
the parameters of the algorithm: use the length of the keyword, the
first character and the last character only.

The concept was obviously to show how a fast look-up could be done when
implementing a compiler, but the hash function itself was interesting at
the time.
  Reply With Quote
4 12th July 14:12
lew
External User
 
Posts: 1
Default find words that contains some specific letters


Because it has constant-time performance for get() with the String hash code.

The Java String hash is just about verbatim from Knuth. It has the virtue of
being a well-studied algorithm that has been shown to have good distribution
characteristics.

I'm not so sure about the Ada one. I haven't studied this in a while, but I
seem to recall that power-of-two multipliers mod 2**32 don't have as good a
distribution. Perhaps someone could correct me on that.

--
Lew
  Reply With Quote
5 12th July 14:12
giovanni azua
External User
 
Posts: 1
Default find words that contains some specific letters


"Giovanni Azua" <bravegag@hotmail.com> wrote:

"John B. Matthews" <nospam@nospam.invalid> wrote:

"Patricia Shanahan" <pats@acm.org> wrote:


To achieve the perfect, optimal and ideal no collisions Hash table "perfect
Hash table" these are the two factors that play a role:

1-. the "perfect" property of the hash function as defined in the wiki page
cited above
2-. the size of the elements set

If #2 goes arbitrarily large meaning allocating a hash table of the size of
that elements set becomes prohibitive memory-wise then it does not really
matter how perfect the hash function is, there will be collisions anyway and
an additional search algorithm will be needed to disambiguate the colliding
elements that fall under the same buckets. This is what I think John
Matthews hasn't realized yet about his implementation of the Jumble
algorithm. For a realistic dataset he will be returning elements that do not
match the of input (String built on top of the sorted character input) but
matches that happen to collide under the same bucket.

e.g. trying to allocate a Hash table of the size ( Integer.MAX_VALUE -
Integer.MIN_VALUE) would be prohibitive unless you place unrealistic
requirements so you have the Integer with a perfect hash function but you
will not get a "perfect Hash table". It is of course worse for the String
class where neither #1 nor #2 are fulfilled.

Character class offers the perfect hash function but since the elements set
is discrete i.e. 26 elements http://en.wikipedia.org/wiki/English_alphabet
you can build a perfect Hash structure like the one I described and edited
by Tom McGlynn.

Best regards,
Giovanni
  Reply With Quote
6 12th July 14:12
giovanni azua
External User
 
Posts: 1
Default find words that contains some specific letters


I am sorry, this is wrong. Matthews's Jumble implementation is correct and
does not have the problem I described here.

Best regards,
Giovanni
  Reply With Quote
7 12th July 14:12
tom anderson
External User
 
Posts: 1
Default find words that contains some specific letters


John, i think you've missed a trick there - doesn't Rotate_Left do a
barrel shift, where the bits that fall off the left-hand end get shoved
back in at the right, rather than a plain shift? If so, the basic
algebraic structure is actually a bit different to java's hash.


If Rotate_Left is a plain shift, then the Ada hash is based only on the
last 10 characters of the string. Multiplying by eight is like shifting
left three places; modding by 2*32 is like anding with 0xffffffff (if
you'll pardon my verbs). When you add in a character, the rightmost bit
you can affect is bit 0, the LSB; after ten shifts of three bits, that is
now at bit thirty, and after one more, it's at bit thirty-three, and has
vanished.

If Rotate_Left is a barrel shift, then our bit of interest doesn't vanish,
because it comes back to the right-hand end of the hash after the eleventh
shift in position 1. Because 3 is coprime with 32, it will then proceed to
make a complete circuit of all positions in the hash over the following
twenty-one shifts, winding up back in position 0 after the thirty-second.
I have a vague worry that this means that strings built from the same
32-character blocks arranged in any order will have the same hash, but i
suspect that this is not the case, because of carries. My maths isn't up
to proving it, though.

I still think this is a shoddy hash. It has no diffusion: if you hash two
strings which differ at one character position, the difference between the
hashes will always be confined to the six**** bits which correspond to
where that character ended up in the rotating hash. Java's
multiply-and-mod approach quickly spreads the bits of each character out
over the whole hash.

tom

--
Science of a sufficiently advanced form is indistinguishable from magic
  Reply With Quote
8 12th July 14:12
john b. matthews
External User
 
Posts: 1
Default find words that contains some specific letters


D'oh, you're absolutely right. I overlooked this when I was tinkering
with my own hash function, too. Thank you for looking at this. [...]


Well, it's a legacy from GNAT.HTable.Hash, but it does very well in the
context of Ada.Containers.Hashed_Maps. Java uses a bucket list whose
size is an integral power of two, while (this implementation of) Ada
uses the next largest prime. Modeling an Ada collision counter on your
example <http://urchin.earth.li/~twic/tmp/Collisions.java>, I got these
results for collision count, number of buckets having that count, and
the percentage of the total:

$ ./collisions
0: 218586 (55.6%)
1: 126250 (32.1%)
2: 38432 (9.8%)
3: 8362 (2.1%)
4: 1354 (0.3%)
5: 229 (0.1%)
6: 24 (0.0%)
7: 4 (0.0%)
8: 1 (0.0%)
The code is here: <http://home.roadrunner.com/~jbmatthews/jumble.html>

[Followup-To set to comp.lang.ada.]
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
  Reply With Quote
Reply


Thread Tools
Display Modes




Copyright 2006 SmartyDevil.com - Dies Mies Jeschet Boenedoesef Douvema Enitemaus -
666