This project is the done during the course "Éléments logiciels pour le traitement des données massives" given by Xavier Dupré and Matthieu Durut at ENSAE ParisTech.
Final grade: 18/20
The idea is to implement a naïve version (using continuous Skip-gram) of the Word2Vec algorithm developed by Mikolov et al. This will be done in Python, using Numpy library. And compare this naïve version to a faster and more scalable version, inspired by the work of Ji & al, in their paper "Parallelizing Word2Vec in Shared and Distributed Memory". Our implementation will be based on Pytorch and we shall compare the performances of the two algorithms in terms of training speed, inference time, parallel schemes, number of threads used, etc.
All evaluations will be run on the same machine with 8 cores/16 threads (Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz), 32GB of RAM and one NVIDIA GeForce RTX 3080 (10GB VRAM).
The code was run and tested in Python 3.8.
pip install -r requirements.txt
This will install Pytorch CPU version. To get the CUDA available version, you must follow the official documentation which can be found here to install the compatible version with your hardware.
To download files relevant for our word2vec.
python3 -m nltk.downloader stopwords
For this project, we used the same data as in "Parallelizing Word2Vec in Shared and Distributed Memory". One can retrieve them by running:
mkdir data && cd data
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
and
cd data
wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar xvzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
cat 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 > 1b
cat 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/* >> 1b
rm -rf 1-billion-word-language-modeling-benchmark-r13output
The previous code has been taken from Ji et al. work and can be found in the corresponding repository.
It contains all the source code divided in 3 main sub-folders:
common
: common source code between the Numpy implementation and the Pytorch one.word2vec_accelerated
: Pytorch version of the modelsword2vec_numpy
: Numpy version of the models
Contains some notebooks to see model training and evaluation.
Contains the source code related to training speed evaluation. The sub-folder results
contains the results of these
training speed tests.
"Parallelizing Word2Vec in Shared and Distributed Memory" by Ji. et al.
Their code has be found here.
Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al.
Efficient Estimation of Words Representations in Vector Space by Mikolov et al.