-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathMilestone Report.Rmd
123 lines (95 loc) · 6.01 KB
/
Milestone Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
Predictive Keyboard. Milestone report
=======================================
The project, in general, is an exercise in building a predictive model for text input, using a keyboard. The predictive model could be a combination of probabilistic models (N-grams, others), and rule-based models (which, in general, could *also* be modeled using probabilities). For various tasks of the keyboard, different models will be used. Since this document is just a milestone to check how the progress is going, I'll reserve the more detailed discussion for the final document.
Deliverables for this milestone
-------------------------------
The main deliverables are:
- Demonstrate that you've downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.
What data do we have?
---------------------
The data provided consists of 4 sets of files, containing samples of tweets, blogs and news, in English, German, Finnish and Russian. Some basic data statistics follow:lines, word counts, etc (used *wc* command, recursively).
```html
lines words file
371440 12653185 .//de_DE/de_DE.blogs.txt
244743 13219388 .//de_DE/de_DE.news.txt
947774 11803735 .//de_DE/de_DE.twitter.txt
899288 37334690 .//en_US/en_US.blogs.txt
1010242 34372720 .//en_US/en_US.news.txt
2360148 30374206 .//en_US/en_US.twitter.txt
439785 12732013 .//fi_FI/fi_FI.blogs.txt
485758 10446725 .//fi_FI/fi_FI.news.txt
285214 3153003 .//fi_FI/fi_FI.twitter.txt
337100 9691167 .//ru_RU/ru_RU.blogs.txt
196360 9416099 .//ru_RU/ru_RU.news.txt
881414 9542485 .//ru_RU/ru_RU.twitter.txt
```
So it seems we have a large amount of data to analyze.
Exploratory analysis
-----------------------------
...create a basic report of summary statistics about the data sets. Basically at this point we want a quick N-grams analysis, with Uni and Bigrams, and check the frequencies of the most used words or expressions. I'll use mostly the **tm** (text mining) and **RWeka** libraries for the initial exploration, I'll start with a subset of the blogs set, in English. The R code should be run from the parent folder of the languages folders (so the folder containing the *data/en_US* folder).
I'll use a small subset of data for this initial exploratory task (several issues with the RWeka, Weka and Java on Mac - error "Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument", which forced me to use a single core on the Mac: options(mc.cores=1) - more details here http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka)).
```{r cache=TRUE}
library(tm)
library(RWeka)
## create a UnigramTokenizer (RWeka)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
## create a BigramTokenizer (RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
## load the english documents
en_texts <- VCorpus(DirSource(directory="data/en_US/small", encoding="UTF-8"),
readerControl=list(language="en"))
## get rid of extra white spaces, stopwords, DON'T STEM YET, switch to lowercase
en_texts <- tm_map(x=en_texts, FUN=removePunctuation)
en_texts <- tm_map(x=en_texts, FUN=removeWords, words=stopwords(kind="en"))
en_texts <- tm_map(x=en_texts, FUN=stripWhitespace)
en_texts <- tm_map(x=en_texts, FUN=tolower)
## create a TermDocumentMatrix
## NOTE - without the "options" underneath, the TermDocumentMatrix call crashes -
## (looks like a parallel processing issue)
options(mc.cores=1)
tdmUnigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=UnigramTokenizer))
tdmBigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=BigramTokenizer))
```
Some quick stats:
- how many words (stemmed at this point) do we have in the dictionary? **`r tdmUnigram$nrow`**
- how many different 2-words expressions we have? **`r tdmBigram$nrow`**
- which are the most frequent words?
```{r cache=TRUE}
findFreqTerms(tdmUnigram, 160)
```
- which are the most used 2-words expressions?
```{r cache=TRUE}
findFreqTerms(tdmBigram, 40)
```
```{r cache=TRUE}
## now, for the stemmed words
en_texts <- tm_map(x=en_texts, FUN=stemDocument)
tdmUnigramStemmed <- TermDocumentMatrix(en_texts, control=list(tokenizer=UnigramTokenizer))
tdmBigramStemmed <- TermDocumentMatrix(en_texts, control=list(tokenizer=BigramTokenizer))
```
Again, same stats as above (this time the words are **stemmed**)
- how many words (stemmed at this point) do we have in the dictionary? **`r tdmUnigramStemmed$nrow`**
- how many different 2-words expressions we have? **`r tdmBigramStemmed$nrow`**
- which are the most frequent words?
```{r cache=TRUE}
findFreqTerms(tdmUnigramStemmed, 160)
```
- which are the most used 2-words expressions?
```{r cache=TRUE}
findFreqTerms(tdmBigramStemmed, 40)
```
A quick interesting note, around the fact that I started with a very small subset of the initial data - 10K blogs. Looking at the stemmed vocabulary (size of Unigrams), it shows a fairly large set, of around 20K words, which is already in the range of a normal person's vocabulary size. It looks like we might not have to go to the largest extent of the initial data set, when analyzing Unigrams.
Further, in regards to the Bigrams, there was an almost ignorable difference in regards to the stemmed vs. non-stemmed sets of words. In regards to the very-high frequency Bigrams, they seem to be made with the very short simple words (which look the same stemmed). This would encourage use to use stemming, since the matrix would be a lot smaller.
Some distributions of frequencies, for Unigrams and Bigrams:
```{r cache=TRUE, echo=FALSE}
unigrams <- inspect(tdmUnigram)
tblUnigrams <- table(unigrams)
hist (log(tblUnigrams), main="Histogram of Unigrams", breaks=50)
```
```{r cache=TRUE, echo=FALSE}
bigrams <- inspect(tdmBigram)
tblBigrams <- table(bigrams)
hist (log(tblBigrams), main="Histogram of Unigrams", breaks=50)
```