Parallelizing Word2Vec

This project is the done during the course "Éléments logiciels pour le traitement des données massives" given by Xavier Dupré and Matthieu Durut at ENSAE ParisTech.

Final grade: 18/20

About this project

The idea is to implement a naïve version (using continuous Skip-gram) of the Word2Vec algorithm developed by Mikolov et al. This will be done in Python, using Numpy library. And compare this naïve version to a faster and more scalable version, inspired by the work of Ji & al, in their paper "Parallelizing Word2Vec in Shared and Distributed Memory". Our implementation will be based on Pytorch and we shall compare the performances of the two algorithms in terms of training speed, inference time, parallel schemes, number of threads used, etc.

All evaluations will be run on the same machine with 8 cores/16 threads (Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz), 32GB of RAM and one NVIDIA GeForce RTX 3080 (10GB VRAM).

Installation

The code was run and tested in Python 3.8.

pip install -r requirements.txt

This will install Pytorch CPU version. To get the CUDA available version, you must follow the official documentation which can be found here to install the compatible version with your hardware.

To download files relevant for our word2vec.

python3 -m nltk.downloader stopwords

Data

For this project, we used the same data as in "Parallelizing Word2Vec in Shared and Distributed Memory". One can retrieve them by running:

mkdir data && cd data
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f

and

cd data
wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar xvzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
cat 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 > 1b
cat 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/* >> 1b
rm -rf 1-billion-word-language-modeling-benchmark-r13output

The previous code has been taken from Ji et al. work and can be found in the corresponding repository.

About the structure of the project

`word2vec_eltdm` folder:

It contains all the source code divided in 3 main sub-folders:

common: common source code between the Numpy implementation and the Pytorch one.
word2vec_accelerated: Pytorch version of the models
word2vec_numpy: Numpy version of the models

`notebooks` folder:

Contains some notebooks to see model training and evaluation.

`speed_tests` folder:

Contains the source code related to training speed evaluation. The sub-folder results contains the results of these training speed tests.

References

"Parallelizing Word2Vec in Shared and Distributed Memory" by Ji. et al.

Their code has be found here.

Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al.

Efficient Estimation of Words Representations in Vector Space by Mikolov et al.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
notebooks		notebooks
speed_tests		speed_tests
word2vec_eltdm		word2vec_eltdm
.gitignore		.gitignore
README.md		README.md
SEKKAT_KAPLOUN_REPORT.pdf		SEKKAT_KAPLOUN_REPORT.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallelizing Word2Vec

About this project

Installation

Data

About the structure of the project

`word2vec_eltdm` folder:

`notebooks` folder:

`speed_tests` folder:

References

About

Releases

Packages

Contributors 3

Languages

chloeskt/word2vec_eltdm

Folders and files

Latest commit

History

Repository files navigation

Parallelizing Word2Vec

About this project

Installation

Data

About the structure of the project

word2vec_eltdm folder:

notebooks folder:

speed_tests folder:

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`word2vec_eltdm` folder:

`notebooks` folder:

`speed_tests` folder:

Packages