forked from trinker/topicmodels_learning
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMimno2013.Rmd
144 lines (103 loc) · 4.26 KB
/
Mimno2013.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
title: "Introduction to R mallet"
author: "David Mimno"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{mallet}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Installation
The ```mallet``` R package is available on CRAN. To install, simply use ```install.packages()```
```{r, eval=FALSE}
install.packages("mallet")
```
To load the package, simply use ```library()```.
```{r}
library(mallet)
```
## Usage
We start out by using the example data from the ```tm``` package.
```{r}
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain))
reuters_text_vector <- unlist(lapply(reuters, as.character))
```
We can also use the stopword file from the ```tm``` package.
```{r}
stopwords_en <- system.file("stopwords/english.dat", package = "tm")
```
Create a mallet instance list object. Right now I have to specify the stoplist as a file, I can't pass in a list from R.
This function has a few hidden options (whether to lowercase, how we define a token). See ```?mallet.import``` for details.
```{r}
mallet.instances <- mallet.import(id.array = as.character(1:length(reuters_text_vector)),
text.array = reuters_text_vector,
stoplist.file = stopwords_en,
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
```
Create a topic trainer object.
```{r}
topic.model <- MalletLDA(num.topics=5, alpha.sum = 1, beta = 0.1)
```
Load our documents. We could also pass in the filename of a saved instance list file that we build from the command-line tools.
```{r}
topic.model$loadDocuments(mallet.instances)
```
Get the vocabulary, and some statistics about word frequencies. These may be useful in further curating the stopword list.
```{r}
vocabulary <- topic.model$getVocabulary()
head(vocabulary)
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
```
Get the vocabulary, and some statistics about word frequencies. These may be useful in further curating the stopword list.
```{r}
vocabulary <- topic.model$getVocabulary()
head(vocabulary)
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
```
Optimize hyperparameters every 20 iterations, after 50 burn-in iterations.
```{r}
topic.model$setAlphaOptimization(20, 50)
```
Now train a model. Note that hyperparameter optimization is on, by default. We can specify the number of iterations. Here we'll use a large-ish round number.
```{r}
topic.model$train(200)
```
**NEW** Run through a few iterations where we pick the best topic for each token, rather than sampling from the posterior distribution.
```{r}
topic.model$maximize(10)
```
Get the probability of topics in documents and the probability of words in topics. By default, these functions return raw word counts. Here we want probabilities,so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
```{r}
doc.topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
```
What are the top words in topic 2? Notice that R indexes from 1 and Java from 0, so this will be the topic that mallet called topic 1.
```{r}
mallet.top.words(topic.model, word.weights = topic.words[2,], num.top.words = 5)
```
Show the first document with at least 5% tokens belonging to topic 1.
```{r}
inspect(reuters[doc.topics[,1] > 0.05][1])
```
How do topics differ across different sub-corpora?
```{r}
usa_articles <- unlist(meta(reuters, "places")) == "usa"
usa.topic.words <- mallet.subset.topic.words(topic.model,
subset.docs = usa_articles,
smoothed=TRUE,
normalized=TRUE)
other.topic.words <- mallet.subset.topic.words(topic.model,
subset.docs = !usa_articles,
smoothed=TRUE,
normalized=TRUE)
```
How do they compare?
```{r}
head(mallet.top.words(topic.model, usa.topic.words[1,]))
head(mallet.top.words(topic.model, other.topic.words[1,]))
```