If you are working with documents one approach is to create word embeddings that allows to represent words with similar meaning.
Updated the code to work with TensorFlow 2. Fix for the deprecation warning will coming soon.
In this jupyter notebook I would like to show how you can create embeddings from scratch using gensim
and visualize them on TensorBoard
in a simple way.
Some time ago I tried the build-in method word2vec2tensor of gensim
to use TensorBoard
, but without success. Therefore I implemented this version in combination with TensorFlow
.
For this example I used a subset of 200000 documents of the Yelp dataset. This is a great dataset that included different languages but mostly english reviews.
As you can see in my animation, it learns the representation of similiar words from scratch. German and other languages are also included!
You can improve the results by tuning some parameters of word2vec, using t-SNE or modifying the preprocessing.
Because of the dataset license I can't publish my training data nor the trained embeddings. Feel free to use the notebook for your own dataset or request the data on Yelp.
Just put your text-files in the defined directory TEXT_DIR
. Everything will be saved in folder defined by MODEL_PATH
.
Finally start TensorBoard:
tensorboard --logdir emb_yelp/
If you would like to use your own trained embeddings for neural networks, you can load the trained weights (vectors) in an embedding layer (e.g. Keras). This is really useful, especially if you have just a few samples to train your network on. Another reason is that existing pre-trained models like Google word2vec or GloVe are maybe not sufficient because they are not task-specific embeddings.
If you need an example how to use trained embeddings of gensim in Keras, please take a look at the code snippet below. This is similiar to this jupyter notebook where I used GloVe. But loading gensim weights is quite a bit different.
def get_embedding_weights(gensim_model, tokenizer, max_num_words, embedding_dim):
model = gensim.models.Word2Vec.load(gensim_model)
embedding_matrix = np.zeros((max_num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
if word in model.wv.vocab and i < max_num_words:
embedding_vector = model.wv.vectors[model.wv.vocab[word].index]
embedding_matrix[i] = embedding_vector
return embedding_matrix
emb_weights = get_embedding_weights(gensim_model='emb_yelp/word2vec',
tokenizer=tokenizer,
max_num_words=MAX_NUM_WORDS,
embedding_dim=EMBEDDING_DIM
)
embedding_layer = Embedding(input_dim=MAX_NUM_WORDS,
output_dim=EMBEDDING_DIM,
input_length=MAX_SEQ_LENGTH,
weights=[emb_weights],
trainable=False
)
[1] Vector Representations of Words
[2] Embeddings
Christopher Masch