Widget for exploring and sampling words from text data through word2vec models in order to construct topic dictionaries.
Use conda
to create a new environment and install the requirements
conda create -n w2widget python=3.9
conda activate w2widget
pip install -e .
In the widget_example.ipynb
you can play with the widget from pretrained data from Reuters dataset.
If you want to see an example of the data-workflow generating the necessary input, check out workflow_example.ipynb
.
This module helps with calculating and handling doc2vec. The approach applied is that every document's vector is calculated by taking a weighted (ie. based on inverse frequencies) average of the document's word vectors.
Note
I recommend using a sentence-embedding model instead of the following approach.
from w2widget.doc2vec import calculate_inverse_frequency, Doc2Vec
# Calculate word weigts from inverse frequency
word_weights = calculate_inverse_frequency(document_tokens)
# Initiate the model
dv_model = Doc2Vec(wv_model, word_weights)
# Add documents and calculated the document vectors
dv_model.add_doc2vec(document_tokens)
# reduce the dimensions
dv_model.reduce_dimensions()
# Store the embeddings
two_dim_doc_embedding = dv_model.TSNE_embedding_array
This widget module displays the results from:
- A gensim word2vec model,
- it's 2-dimensional embedding (ie. TSNE).
- The custom implemented doc2vec model,
- it's 2-dimensional embedding (ie. TSNE).
- A list of tokenized documents with whitespaces and
- optionally a list of initial search words
from w2widget.widget import Widget
wv_widget = Widget(
wv_model,
two_dim_word_embedding,
tokens_with_ws
dv_model=None,
two_dim_doc_embedding=None,
initial_search_words=[],
)
wv_widget.display_widget()
You can save the topics to a json
file from the widget, or access them from the dictionary stored in wv_widget.topics
.