title | author | date |
---|---|---|
Word Embeddings (with possible pre-workshop typos) |
Dave Campbell |
2020-06-04Part2.3 |
library(devtools)
library(tm)
install_github("bmschmidt/wordVectors") # yup install from GitHub
library(wordVectors)
The goal is to convert words into vectors of numbers such that math on the vectors makes sense as math on the words. The classic examples are things like king - man + woman = queen or Paris - France + Canada = Ottawa
The concept is to use a model to predict a word from those around it (or the inverse: predict surrounding words from a central word). The process converts vector(s) of dummy variable encoded words into a low dimensional subspace called an embedding dimension. This low dimensional embedding space allows us to do this sort of vector math on words.
The original paper included two strategies. Borrowing a figure from that paper:
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) "Efficient Estimation of Word Representations in Vector Space". ICLR
For a vocabulary of size
The Skip-gram model is the opposite, taking a single one-hot encoded vector of length
Borrowing a figure from the second paper from the same group in 2013, vector differences in the hidden layer seem to represent the right kind of behaviour when comparing many countries with their capitals.
Mikolov et al, (2013)"Distributed Representations of Words and Phrases and their Compositionality" NeuIPS.
The hidden layer embedding of dimension
[cos(\theta)=\frac{\sum_{i=1}^{N}A_iB_i}{\sqrt{\sum_{i=1}^NA^2_i}\sqrt{\sum_{i=1}^NB^2_i}}.] Closeness scores are 1 if the words are exactly the same, 0 if orthogonal, and -1 if exactly opposite. One problem is that sometimes antonyms are used in similar ways which would give them high similarity. For example "I love this kale, it's delicious!" vs "I love this donut, it's delicious!"
To be honest, model fitting is slow for a large vocabulary and large text corpus. Considering the skip-gram architecture for clarity, we typically train such a model by maximizing the average log probability of the observed output words in a window of size
The original paper published their fitted model, but the model is ~3.64GB, and takes several minutes to load into R.
Consequently we will use a much smaller corpus and fit our own model. I haven't seen much of a difference between CBOW and skipgram in practice.
The workhorse library wordVectors is a wrapper to the Google code. The main function:
- train_word2vec(text_to_model,name_of_output_file,
vectors=Embedding_Dimension,threads=CPU_cores_to_use,
cbow=binary_logical_1_->_for_CBOW_0_for_skipgram,window=Window_width, iter=Times_the_algorithm_passes_through_entire_corpus,negative_samples= This_is_complex_see_below)
When fitting the model, the training must pass over the entire corpus to update parts of the neural network associated with the observed window of data. The vast majority of the vocabulary is obviously not contained in the short window and is considered a negative (not present) observation. You could update parts of the neural net for the entire corpus for each observed word, or you could just update a random sample taken to be a small number of vocabulary elements. This number of words to update is the negative samples input. Typically for a very large corpus we keep this around 2 or 5 since there are many other opportunities to update the model and larger numbers take longer to run, but for a smaller corpus we often increase to something like 10 or 15. Results are generally quite robust to this number. On a realted note, with a large corpus, we typically use a smaller number of iterations since each pass through the corpus has a lot of training examples.
As an example we will use a small corpus, the cookbooks example from vignette("introduction", "wordVectors"). The corpus includes 76 texts from Michigan State University library spanning late 18th to early 20th century. This corpus is a set of
If you need to download the cookbooks do so here. Unzip the file and move it to your working directory.
# takes me about 2:15 to run
T1 = Sys.time()
model = train_word2vec("content/post/cookbooks.txt","content/post/cookbook_vectors.bin",
vectors=200,threads=4,cbow=1,window=12,iter=5,negative_samples=7)
T2 = Sys.time()
(Elapsed = T2-T1)
Words are used in complex ways and often a word changes meaning drastically when considering it as a pair of words, for example Parliament vs Parliament Hill . In the cookbook data we have such compounds like white_wine, lyonnaise_potatoes, and string_beans. You can automatically produce such groupings by skimming documents to find word pairs that occur frequently and then 'glueing' them together. The function wordVectors::word2phrase let's you set thresholds for how frequently words co-occur before they are joined. It is impossible to separate data cleaning from data analysis. Here the phrasing also created beasts like coffee_dinner and coffee_luncheon.
It might be of interest to see what words the embedding space places close to others. Some examples are very succesful and may suggest alternative ingredients.
model %>% closest_to("fish")
model %>% closest_to("peach")
model %>% closest_to("cinnamon")
model %>% closest_to("sweet")
model %>% closest_to("coffee_luncheon")
model %>% closest_to("coffee_dinner")
model %>% closest_to("squirrel")
model %>% closest_to("woman")
model %>% closest_to("man")
Fish is to dumpling as apple is to? These analogues make a standard test for the quality of the Word2Vec model.
model %>% closest_to(~"dumpling"-"fish"+"apple",15)
model %>% closest_to("dumpling",15)
However, the data input is not exactly representative of the population of use cases for a word. The sampling bias of building a model based on news articles becomes apparent when the Google model is run (large file takes a long time to load, see above for download).
#load in the google pre-trained model:
# takes about 10 minutes to load
model = read.vectors("content/post/GoogleNews-vectors-negative300.bin")
model
#A VectorSpaceModel object of 3000000 words and 300 vectors
# examine the model for Paris is to France as Tokyo is to?
#runtime is ~ 20 seconds
model %>% closest_to(~ "france" - "paris" + "tokyo")
#(often people speed up the model by removing some of the sparse dimensions...)
# King vector - Man vector + Woman vector is close to what?
#runtime is ~ 20 seconds
model %>% closest_to(~ "king" - "man" + "woman")
# Man is to Computer Programmer as Woman is to?
#runtime is ~ 20 seconds
#i.e. Computer Programmer vector - Man vector + Woman vector is close to what?
model %>% closest_to(~ "computer_programmer" - "man" + "woman")
The problem with complex models is that their implications are mysterious and difficult to assess.The gendering of words took 3 years to bring out into the open:
Bolukbasi et al (2016) "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" NeurIPS
The data used to fit the Google model was a convenience sample and is not representative of the population of ways that words are or should be used.
The problem of removing innapropriate bias is not completely solved but that group worked on post-processing to rotate dimensions that should not be gendered. This idea of post-processing has also been performed to improve matching with conventional analogies. Several new versions of Word2Vec are optimized based on matching word probabilities with a penalty placed on deviations from some human determined analogies.
The embeddings could be normalized, in which case the vectors take values on the unit hyper-sphere. It is tempting to consider a document as a random sample of words. Replacing the words by their embedding vectors projects, in the case of the Google model, 3,000,000 words onto a 300 dimensional sphere.
We might want to then compare 2 documents, as though they were two samples and test the null hypothesis that they are samples from the same distribution.
The Von Mises - Fisher Distribution is used to describe samples on the unit sphere. Generally this is used for directional data in
[f_p(x,\mu,\kappa)=\frac{\kappa^{p/2-1}}{(2\pi)^{p/2}I_{p/2-1}(\kappa)}(\kappa)exp(\kappa\mu^Tx).]
The
The MLE parameter estimate for the mean direction is the vector sum of all vectors divided by the length of the vector,
[\hat{\mu}= \frac{\sum_{i=1}^Nx_i}{\mid\mid \sum_{i=1}^Nx_i \mid\mid}.]
For large
[\hat{\kappa}\approx\frac{\bar{r}(p-\bar{r}^2)}{1-\bar{r}^2},] for average vector [\bar{r} = \frac{\mid\mid\sum_i x_i\mid\mid}{n}]
Hypothesis testing is then ANOVA-ish and scales up to samples from
For k populations with sample sizes
[W = \frac {(n-k)\left(\sum_{i=1}^kR_i-R\right)} {(k-1)(n-\sum_{i=1}^kR_i)} \sim F_{(k-1)(p-1), (n-k)(p-1)}]
In the Google case p was 300. In a small samples of
Hypothesis testing remains an open problem, but could be useful for assessing plagerism of ideas or author attribution models.
Using Word2Vec to discover new materials:
Tshitoyan et al (2019) "Unsupervised word embeddings capture latent knowledge from material science literature", Nature 571, 95-98
A generalization of Word2Vec:
Pennington et al. (2014) "GloVe: Global Vectors for Word Representation" Empirical Methods in Natural Language Processing (EMNLP)
Excellent book on circular statistics. See Chapter 5 for more options for test statistics on the sphere, but unfortunately it wasn't designed for our massive dimensions
Jammalamadaka, S. Rao., and Ambar Sengupta. (2001) "Topics in Circular Statistics". River Edge, N.J: World Scientific