A pure javascript client-side trie-based dictionary, which is compact and can do fast look-ups.
The library is pretty-much a direct copy of code from Steve Hanov.
The changes I've made are (a) set up for RequireJS
, and (b) added the
SimpleSelectIndex
class to speed up word lookup. This was needed for a
a Boggle Solver I made.
You can try the code out by downloading it and loading
index.html
into a browser. A demo is available here
The original motivation was to solve Boggle client-side, with the following requirements:
- the amount of data uploaded should be small, for a fast start-up;
- the time for code to initialize to be fast, again to enable a fast start-up;
- and the run-time performance should be "fast enough" so that users aren't aware of waiting around for results.
These are non-trivial requirements to implement in a webapp such as Boggle where a large dictionary is needed and a significant amount of computation takes place to "solve" a setup. The upside is that, if these requirements can be met, a large number of users can be supported at low cost as all the computation is done client-side. The other advantage of a client-side solution is that it can work offline, since nothing is required from the server in routine operation.
There are, of course, other applications of a large but compact dictionary, with fast lookup, stored client-side.
John Resig wrote a series of blog posts (here, here and here) a couple of years ago, studying a similar problem -- looking up words from a dictionary.
An interesting solution Resig looked at was supplied by Steve Hanov who used a succinct data structure to implement a Trie.
Succinct data structures were introduced by Jacobson (G. J. Jacobson 1988) in his PhD thesis (a summary paper can be found in (G. Jacobson 1989)). They have interesting properties, being both space and time efficient. Succinct trees are asymptotically space-optimal yet can be used directly, without decompression or other pre-processing, and perform within a small constant time factor of pointer-based data structures.
This is very useful for our problem: the data structures can be sent from the server without taking up too much space, and the client does not need to decode the structures in order to process them. This is important for JavaScript as Resig found that decompressing a trie can take a long time on mobile devices.
Hanov's solution, for which he provides JavaScript code, implements Jacobson's ideas with a twist. Bitstrings are stored in Base64, which means the data doesn't need to be decoded client-side, at the expense of storing 6 bits of information in an 8-bit byte.
The only problem with Hanov's solution is that it is slow. I
downloaded a word list containing about 270,000
words from
this file here and found that using Hanov's trie Boggle took
about 20 seconds to solve on the HTC Desire, which is not acceptable.
For John Resig's problem, looking up a single word, Hanov's solution is probably OK as only a few nodes of the Trie have to be searched. The Boggle solver has to look at 3-5000 nodes of the trie as we're doing an exhaustive search of the Boggle board and this is too slow by at least an order of magnitude if we are to meet our run-time performance requirements.
The bottleneck occurs with the algorithm for the select
function. select(i)
finds the location in a bit string of the i
th
occurrence of the zero bit. Hanov's implementation is based on the
rank
function where rank(i)
is the number of zero bits set up to
and including position i
. These two functions are inverses, so that
if rank(j) = k
then select(k) = j
. Hanov's algorithm does a
binary search over the rank
function to determine select, and this
is slow.
I have implemented a simple alternative select
function which stores
an array S
where the i
th element contains the value for
select(i * N)
where N
is set at 32
, but can be any number. I've
tried 64
which uses half the space and takes about 30% longer. To
find select(k)
we find S[k/N]
and can then count the remaining
bits. In effect each select is then looking for a value less than
N
. Performance depends on the distribution of the zeros and ones
being reasonably uniform (which it is in this case). The speedup is
significant -- in my tests a factor of about 14
. This makes the
wait for a solution about a second on a slow phone, which is
acceptable.
Having implemented this solution it looks very similar to that
proposed, implemented and tested in (Navarro and Providel 2012)
for select
queries.
So, we end up with a data structure, which is reasonably compact. The
gzipped trie (ie the size sent over the wire) is 368,640
bytes,
while the gzipped dictionary is 470,298
bytes. The uncompressed
dictionary is much larger than the uncompressed trie of course. So using the
succinct structure saves space compared to the raw dictionary as well
as being optimised for word search, and we can do everything in
JavaScript with reasonable performance.
Setting up our web site/app to work offline is straightforward if we use the Offline cache feature of HTML5 (well described here).
Jacobson, Guy Joseph. 1988. “Succinct static data structures.” Pittsburgh, PA, USA.
Jacobson, Guy. 1989. “Space-efficient static trees and graphs.” In Foundations of Computer Science, 1989., 30th Annual Symposium on, 549–554. IEEE.
Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, small, simple rank/select on bitmaps.” In Experimental Algorithms, 295–306. Springer.