The output behaviour:
Bit vector size invalid word accepted?
case a 500KB No.
case b 375KB No.
case c 250KB No.
case d 125KB Yes.
Final case: 93.75 KB Yes.
The first numbers on the bit vector memory feel not so good for a 58K word list. The mathematics that, describes the relation between the hash bit length (32 here), word list length and the probablity of an invalid word hashing correctly is described by McIlroy. This comes into play in the last two cases above. I did not prove it but, runs of the program point to that clearly. It can still be downsized but, the point was to try the scatter technique for the spell checker.
Observation: I did change the 32 values used in the hash functions which gave me a good reduction in bit vector size and lower error rate. The
final run used 32 consecutive values from the farther end of the proposed hash table size. Previously, I was using differences of 100 ie N - 100, N - 200, N -300, N -400 ...... N - (32 * 100). All that left me was with even values in the hash functions. This affected the accuracy of the hash I can say from the run of the program (i.e invalid words matched seeminlgy easy). But, when I used the consecutive values i.e N - 1, N -2....N - 32 I have introduced primes not purposefully. But, the urge to change the hash values was there. McIlroy says that, the use of n primes for the n hash functions is very good and he used it. A lot of reading on the web claimed that, the use of primes in hash function and their advantage was not that great. But, this small program with only a couple of primes seems to be faring better than with out them. Why 32? McIlroy used 27 I picked 32!. :)). This choice is to be based on the mathematical relation I pointed to above (refer McIlRoy paper).
The program when profiled in Netbeans takes 0.169ms to find if a word in the dictionary. A run of the program on this 573 word post gave optimized, Netbeans, McIlroy, atleast, probablity, seeminlgy as invalid. True. None of these were in the word list. "seeminlgy" is misspelt in this post! The checking code took 20.4 ms.