Skip to content

This repository contains the code to generate the word2vec model for Prometheus.

Notifications You must be signed in to change notification settings

Prometheus-Extractor/prometheus-word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prometheus word2vec

This repository contains the code to generate the word2vec model for Prometheus.

Setup

Checkout the word2vec submodule (the reference implementation) and then build it.

git submodule init
cd word2vec
make

Training

We use the following plaintext corpora:

  • sv - Vocab size: 1163288 Words in train file: 284410463
  • en - Vocab size: 4891175, Words in train file: 2989787812

To allow for an unknown vector first create the vocabulary, manually append it a unknown word and then train the model.

./word2vec -train <input.txt> -save-vocab vocab.txt
echo "__UNKNOWN__ 0" >> vocab.txt
./word2vec -train <input.txt> -binary 1 -output <model.bin> -size 300 -window 5 -sample 1e-4 -negative 5 -hs 0 -cbow 1 -iter 3 -read-vocab vocab.txt -threads 4

Produce Optimized Model

Thanks to Marcus Klang there exists a way to create an extremly fast binary model. This model is read using memory mapping in Java at near IO speed.

It can be created from the text model file. To produce it, run the train command with the -binary 0 flag.

cd vectortool
mvn package
cd target
java -jar vectortool-1.0-SNAPSHOT.jar convert ../../model.txt model.opt

Once the model is created it can be accessed using:

java -jar closest ../../model.opt

It is also possible to read it from your Java/Scala program, for how that is done, look in the Word2vec.java class.

About

This repository contains the code to generate the word2vec model for Prometheus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published