Implementations of the classic and covering locality-sensitive hashing schemes for Hamming space
Hemingway is a library for performing nearest-neighbour searches in datasets consisting of high-dimensional bit vectors using locality-sensitive hashing. Two flavors of LSH are currently supported: Classic LSH, originally described by Piotr Indyk and Rajeev Motwani, and Covering LSH, a newer construction by Rasmus Pagh. You can read our paper on the library here.
As for the name? Say the words "locality-sensitive hashing" and "Hamming space" fast enough a bunch of times and you'll eventually arrive at "Hemingway". If not, you probably didn't pick appropriate parameters.
Hemingway can be built using any compiler that supports C++11, but ships with a CMake setup for ease of use. To get started, make sure that CMake is installed and then do:
cmake . && make
This will compile the implementation files to src/libhemingway.a
. Grab this along with the header files and you'll be all set! You can check out the next section for a guide on how to actually use the library.
The design of Hemingway is fairly simple in that it exposes only two classes for all your LSH needs: lsh::vector
and lsh::table
. The lsh::vector
class is used for representing bit vectors of arbitrary dimensionality and is constructed by supplying a std::vector<bool>
containing the components of the vector:
lsh::vector v({1, 0, 0, 0, 1, 1, 0, 1});
The lsh::table
class is used for representing a lookup table containing partitions of vector buckets. Two LSH schemes are currently supported in lookup tables:
Classic: In this scheme, vectors are hashed into buckets using random bit masks associated with each partition. When constructing this table, 3 parameters are specified: The dimensionality of input vectors, the number of bits to sample from vectors, and the number of partitions to use:
lsh::table t({.dimensions = 8, .samples = 3, .partitions = 4);
Covering: In this scheme, vectors are hashed into buckets using carefully constructed bit masks that ensure that the hashes of vectors within a given radius from each other will collide. Only two parameters are specified when constructing this table: The dimensionality of input vectors, and the radius that should be covered:
lsh::table t({.dimensions = 8, .radius = 2);
Once you've constructed your table, go ahead and add your vectors:
t.insert(v);
Once you've added all your vectors you can perform queries against the lookup table:
lsh::vector r = t.query(lsh::vector({1, 0, 1, 0, 1, 1, 0, 0}));
This library came about as a result of the Advanced Algorithms seminar held at the IT University of Copenhagen. We would like to give thanks to our supervisors for not only their help but also their immense patience during the seminar.
Kasper Kronborg Isager | Radosław Niemczyk |
Copyright © 2016 Kasper Kronborg Isager and Radosław Niemczyk. Licensed under the terms of the MIT license.