Word embeddings from scratch and visualization

If you are working with documents one approach is to create word embeddings that allows to represent words with similar meaning.

* UPDATE * - February 18th, 2020

Updated the code to work with TensorFlow 2. Fix for the deprecation warning will coming soon.

In this jupyter notebook I would like to show how you can create embeddings from scratch using gensim and visualize them on TensorBoard in a simple way.
Some time ago I tried the build-in method word2vec2tensor of gensim to use TensorBoard, but without success. Therefore I implemented this version in combination with TensorFlow.

For this example I used a subset of 200000 documents of the Yelp dataset. This is a great dataset that included different languages but mostly english reviews.

As you can see in my animation, it learns the representation of similiar words from scratch. German and other languages are also included!

You can improve the results by tuning some parameters of word2vec, using t-SNE or modifying the preprocessing.

Usage

Because of the dataset license I can't publish my training data nor the trained embeddings. Feel free to use the notebook for your own dataset or request the data on Yelp. Just put your text-files in the defined directory TEXT_DIR. Everything will be saved in folder defined by MODEL_PATH.

Finally start TensorBoard:

tensorboard --logdir emb_yelp/

Using trained embeddings in Keras

If you would like to use your own trained embeddings for neural networks, you can load the trained weights (vectors) in an embedding layer (e.g. Keras). This is really useful, especially if you have just a few samples to train your network on. Another reason is that existing pre-trained models like Google word2vec or GloVe are maybe not sufficient because they are not task-specific embeddings.

If you need an example how to use trained embeddings of gensim in Keras, please take a look at the code snippet below. This is similiar to this jupyter notebook where I used GloVe. But loading gensim weights is quite a bit different.

def get_embedding_weights(gensim_model, tokenizer, max_num_words, embedding_dim):
    model = gensim.models.Word2Vec.load(gensim_model)
    embedding_matrix = np.zeros((max_num_words, embedding_dim))
    for word, i in tokenizer.word_index.items():
        if word in model.wv.vocab and i < max_num_words:
            embedding_vector = model.wv.vectors[model.wv.vocab[word].index]
            embedding_matrix[i] = embedding_vector
    return embedding_matrix
    

emb_weights = get_embedding_weights(gensim_model='emb_yelp/word2vec',
                                    tokenizer=tokenizer,
                                    max_num_words=MAX_NUM_WORDS,
                                    embedding_dim=EMBEDDING_DIM
                                   )

embedding_layer = Embedding(input_dim=MAX_NUM_WORDS,
                            output_dim=EMBEDDING_DIM,
                            input_length=MAX_SEQ_LENGTH,
                            weights=[emb_weights],
                            trainable=False
                           )

References

[1] Vector Representations of Words
[2] Embeddings

Author

Christopher Masch

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Create_Embeddings.ipynb		Create_Embeddings.ipynb
README.md		README.md
embedding.gif		embedding.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word embeddings from scratch and visualization

* UPDATE * - February 18th, 2020

Usage

Using trained embeddings in Keras

References

Author

About

Releases

Packages

Languages

cmasch/word-embeddings-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Word embeddings from scratch and visualization

*** UPDATE *** - February 18th, 2020

Usage

Using trained embeddings in Keras

References

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

* UPDATE * - February 18th, 2020

Packages