This repository contains the code to generate the word2vec model for Prometheus.
Checkout the word2vec submodule (the reference implementation) and then build it.
git submodule init
cd word2vec
make
We use the following plaintext corpora:
- sv - Vocab size: 1163288 Words in train file: 284410463
- en - Vocab size: 4891175, Words in train file: 2989787812
To allow for an unknown vector first create the vocabulary, manually append it a unknown word and then train the model.
./word2vec -train <input.txt> -save-vocab vocab.txt
echo "__UNKNOWN__ 0" >> vocab.txt
./word2vec -train <input.txt> -binary 1 -output <model.bin> -size 300 -window 5 -sample 1e-4 -negative 5 -hs 0 -cbow 1 -iter 3 -read-vocab vocab.txt -threads 4
Thanks to Marcus Klang there exists a way to create an extremly fast binary model. This model is read using memory mapping in Java at near IO speed.
It can be created from the text model file. To produce it, run the train command with the -binary 0
flag.
cd vectortool
mvn package
cd target
java -jar vectortool-1.0-SNAPSHOT.jar convert ../../model.txt model.opt
Once the model is created it can be accessed using:
java -jar closest ../../model.opt
It is also possible to read it from your Java/Scala program, for how that is done, look in the Word2vec.java class.