Find similar sets of things, using LSH. This algorithm is currently in production to block over six billion sets.
This project provides a library that can be used by itself and it also contains a Spark job
./gradlew
This builds all jars including uber jars for the projects that need them.
Download the top 333k search terms or create a file for the random words you want to use in the following format:
the 23135851162
of 13151942776
and 12997637966
to 12136980858
a 9081174698
in 8469404971
for 5933321709
is 4705743816
on 3750423199
that 3400031103
I used the following list: https://github.com/first20hours/google-10000-english
java -jar lsh/build/libs/lsh-1.0.jar gen-sentences data/30K_top_words.txt 50 100000 5 10 data/sentences.txt
where 30K_top_words.txt is used to seed the sentences with words 50 is the number of top words to use 100,000 is the number of sentences to generate 5 is the minimum number of words in a sentence 10 is the max number of words in a sentence data/sentences.txt is the file you want to write the sentences to
It's important to understand the number of sentences will impact how much truth you have and the next step to generate truth essentially compares each sentence against the other sentences. 100,000 sentences means there will be almost 5 billion compares (100000 * 99999/2) but only the sentence pairs that pass the threshold (0.4 similar) will be saved to the truth file.
The data looks like this:
0,my all free but to more you that new if to of was new one or on for
1,are with free have an if you one
2,other be is this an for be we i not you a home on us this page
3,not can was are have it home information of not from be at but was page
where the first column is the sentence key and the second column. I used comma-delimited to make it easier to load this data into Kafka so I can LSH stuff using Kafka Streams
Generating a truth set results in comparing each sentence against the other sentences. This is a brute force process and thus there two approaches available:
- A Spark job which you can run locally to utilizae all of your cores.
- A single-threaded java class.
spark-submit --master local[*] spark/build/libs/spark-0.1.0.jar genTruth data/sentences.txt data/truth 0.7
NOTE: The spark job will write multiple files to the data/truth directory.
java -jar lsh/build/libs/lsh-1.0.jar gen-truth data/sentences.txt data/truth.txt 0.7
where: data/sentences.txt - is the source to use for the truth set data/truth or data/truth.txt - is where the truth will be written to. For the single-threaded java class, a single file will be written 0.7 - is the minimum jaccard similarity score the sentence pair must have to make it into the truth set
When doing this locally, my truth file resulted in about 17,000 pairs.
The data looks like:
0 336 0.4 10/23
0 682 0.4 8/20
0 1255 0.4 8/20
0 2577 0.5 10/20
0 3152 0.4 9/21
where the 1st column is a doc id (smaller ids should always be first) the 2nd column is the other doc id the 3rd column is the Jaccard score the 4th colums is the the number of common set fields/the total number of unique set fields
This command is a brute force compare. It takes a while.
Install Spark 2.0 either locally or on a cluster. Installing it locally, means you just unzip/untar it in a directory and add SPARK_INSTALL/bin to your PATH.
spark-submit --master local[*] spark/build/libs/spark-1.0-all.jar data/sentences.txt data/truth.txt 40 5 true false 256
* is the number of workers you want to concurrently run when in local spark mode. Set it to a specific number to use that number of cores.
data/sentences.txt is the file that contains the sentences you want to generate LSH keys for.
data/truth.txt is the truth file.
40 is the number of hash functions to use for Minhash.
5 is the number of rows (minhash values) you want in per hash key.
true/false is if you want to use a new shifting key feature idea I came up with. If you know what an ngram is, then think of ngraming the minhash key.
true/false is if you want to compress the LSH key down to a 23 character string. The more rows per band, the more savings this gives you. However, if you want smaller bands then this might actually increase the size of your key.
256 is the max number of ids that can fit in a block. After that, the block is marked as oversized by clearing out all ids and replacing them with a -1.
Testing things out via Flink (it seems to be considerably faster than Spark) (CURRENTLY NOT WORKING)
Install Flink 1.1.1 either locally or on a cluster. Installing it locally, means you just unzip/untar it in a directory and add FLINK_HOME/bin to your PATH.
flink run -q flink/build/libs/flink-1.0-all.jar -sentences file:///ABSOLUTE_PATH_TO/sentences.txt -truth file:///ABSOLUTE_PATH_TO/truth.txt -ck true -out file:///ABSOLUTE_PATH_TO/blocks -r 4 -mbs 256 -runs 12 -mrb 10
sentences is the file that contains the sentences you want to generate LSH keys for
truth is the truth file
r is the number of rows (minhash values) you want in per hash key.
ck if you want to compress the LSH key down to a 23 character string. The more rows per band, the more savings this gives you. However, if you want smaller bands then this might actually increase the size of your key
mbs max block size or the max number of ids that can fit in a block. After that, the block is marked as oversized by clearing out all ids and replacing them with a -1
runs the number of runs you want to do for each row-band. For example, if you set r to 3 and runs to 4, then it will run 3 hashes 3 row-bands, 6 hashes 3 row-bands, 9 hahses 3 row-bands and 12 hashes 3 row-bands jobs.
mrb max rows per block. This is essentially the stopping point on the number of rows-per-band you want the job to stop at.
The output looks like:
0.2 0.06 225107/4073981
0.3 0.15 195752/1347864
0.4 0.36 73983/203744
0.5 0.68 11228/16403
0.6 0.91 1282/1404
0.7 0.99 154/155
0.8 1.0 25/25
0.9 1.0 2/2
Total Precision: 0.6371728189077999
507533/507533 + 289006
Total Recall: 0.08993106855260971
507533/5643578
1st column is the jaccard score to the nearest tenth 2nd column is the recall score or the number of pairs found from being in the same LSH blocks over the total number of truth pairs 3rd column is simply the hard numbers for the recall score
The Total Precision is defined as the total pairs found from LSH blocks over the sum of the pairs found and the pairs in the blocks that ended in no truth. The Total Recall: is the total pairs found from LSH over the total truth pairs