Skip to content

Word2Vec naïve version from scratch vs Word2Vec parallelized version.

Notifications You must be signed in to change notification settings

chloeskt/word2vec_eltdm

Repository files navigation

Parallelizing Word2Vec

This project is the done during the course "Éléments logiciels pour le traitement des données massives" given by Xavier Dupré and Matthieu Durut at ENSAE ParisTech.

Final grade: 18/20

About this project

The idea is to implement a naïve version (using continuous Skip-gram) of the Word2Vec algorithm developed by Mikolov et al. This will be done in Python, using Numpy library. And compare this naïve version to a faster and more scalable version, inspired by the work of Ji & al, in their paper "Parallelizing Word2Vec in Shared and Distributed Memory". Our implementation will be based on Pytorch and we shall compare the performances of the two algorithms in terms of training speed, inference time, parallel schemes, number of threads used, etc.

All evaluations will be run on the same machine with 8 cores/16 threads (Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz), 32GB of RAM and one NVIDIA GeForce RTX 3080 (10GB VRAM).

Installation

The code was run and tested in Python 3.8.

pip install -r requirements.txt

This will install Pytorch CPU version. To get the CUDA available version, you must follow the official documentation which can be found here to install the compatible version with your hardware.

To download files relevant for our word2vec.

python3 -m nltk.downloader stopwords

Data

For this project, we used the same data as in "Parallelizing Word2Vec in Shared and Distributed Memory". One can retrieve them by running:

mkdir data && cd data
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f

and

cd data
wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar xvzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
cat 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 > 1b
cat 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/* >> 1b
rm -rf 1-billion-word-language-modeling-benchmark-r13output

The previous code has been taken from Ji et al. work and can be found in the corresponding repository.

About the structure of the project

word2vec_eltdm folder:

It contains all the source code divided in 3 main sub-folders:

  • common: common source code between the Numpy implementation and the Pytorch one.
  • word2vec_accelerated: Pytorch version of the models
  • word2vec_numpy: Numpy version of the models

notebooks folder:

Contains some notebooks to see model training and evaluation.

speed_tests folder:

Contains the source code related to training speed evaluation. The sub-folder results contains the results of these training speed tests.

References

"Parallelizing Word2Vec in Shared and Distributed Memory" by Ji. et al.

Their code has be found here.

Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al.

Efficient Estimation of Words Representations in Vector Space by Mikolov et al.

About

Word2Vec naïve version from scratch vs Word2Vec parallelized version.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •