Milestone Report.Rmd

Predictive Keyboard. Milestone report
=======================================

The project, in general, is an exercise in building a predictive model for text input, using a keyboard. The predictive model could be a combination of probabilistic models (N-grams, others), and rule-based models (which, in general, could *also* be modeled using probabilities). For various tasks of the keyboard, different models will be used. Since this document is just a milestone to check how the progress is going, I'll reserve the more detailed discussion for the final document. 


Deliverables for this milestone
-------------------------------

The main deliverables are: 
- Demonstrate that you've downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.


What data do we have?  
---------------------

The data provided consists of 4 sets of files, containing samples of tweets, blogs and news, in English, German, Finnish and Russian. Some basic data statistics follow:lines, word counts, etc (used *wc* command, recursively). 

```html
  lines       words     file
 371440    12653185    .//de_DE/de_DE.blogs.txt
 244743    13219388    .//de_DE/de_DE.news.txt
 947774    11803735    .//de_DE/de_DE.twitter.txt
 
 899288    37334690    .//en_US/en_US.blogs.txt
1010242    34372720    .//en_US/en_US.news.txt
2360148    30374206    .//en_US/en_US.twitter.txt
 
 439785    12732013    .//fi_FI/fi_FI.blogs.txt
 485758    10446725    .//fi_FI/fi_FI.news.txt
 285214     3153003    .//fi_FI/fi_FI.twitter.txt
 
 337100     9691167    .//ru_RU/ru_RU.blogs.txt
 196360     9416099    .//ru_RU/ru_RU.news.txt
 881414     9542485    .//ru_RU/ru_RU.twitter.txt
```

So it seems we have a large amount of data to analyze. 


Exploratory analysis
-----------------------------

...create a basic report of summary statistics about the data sets. Basically at this point we want a quick N-grams analysis, with Uni and Bigrams, and check the frequencies of the most used words or expressions. I'll use mostly the **tm** (text mining) and **RWeka** libraries for the initial exploration, I'll start with a subset of the blogs set, in English. The R code should be run from the parent folder of the languages folders (so the folder containing the *data/en_US* folder). 
I'll use a small subset of data for this initial exploratory task (several issues with the RWeka, Weka and Java on Mac - error "Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument", which forced me to use a single core on the Mac: options(mc.cores=1) - more details here http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka)). 


```{r cache=TRUE}
library(tm)
library(RWeka)

## create a UnigramTokenizer (RWeka)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
## create a BigramTokenizer (RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

## load the english documents
en_texts <- VCorpus(DirSource(directory="data/en_US/small", encoding="UTF-8"), 
                              readerControl=list(language="en"))

## get rid of extra white spaces, stopwords, DON'T STEM YET, switch to lowercase
en_texts <- tm_map(x=en_texts, FUN=removePunctuation)
en_texts <- tm_map(x=en_texts, FUN=removeWords, words=stopwords(kind="en"))
en_texts <- tm_map(x=en_texts, FUN=stripWhitespace)
en_texts <- tm_map(x=en_texts, FUN=tolower)

## create a TermDocumentMatrix  
## NOTE - without the "options" underneath, the TermDocumentMatrix call crashes - 
## (looks like a parallel processing issue)
options(mc.cores=1)
tdmUnigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=UnigramTokenizer))
tdmBigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=BigramTokenizer))
```

Some quick stats: 
- how many words (stemmed at this point) do we have in the dictionary? **`r tdmUnigram$nrow`**
- how many different 2-words expressions we have? **`r tdmBigram$nrow`**
- which are the most frequent words? 
```{r cache=TRUE}
findFreqTerms(tdmUnigram, 160)
```
- which are the most used 2-words expressions? 
```{r cache=TRUE}
findFreqTerms(tdmBigram, 40)
```

```{r cache=TRUE}
## now, for the stemmed words
en_texts <- tm_map(x=en_texts, FUN=stemDocument)
tdmUnigramStemmed <- TermDocumentMatrix(en_texts, control=list(tokenizer=UnigramTokenizer))
tdmBigramStemmed <- TermDocumentMatrix(en_texts, control=list(tokenizer=BigramTokenizer))
```

Again, same stats as above (this time the words are **stemmed**)
- how many words (stemmed at this point) do we have in the dictionary? **`r tdmUnigramStemmed$nrow`**
- how many different 2-words expressions we have? **`r tdmBigramStemmed$nrow`**
- which are the most frequent words? 
```{r cache=TRUE}
findFreqTerms(tdmUnigramStemmed, 160)
```
- which are the most used 2-words expressions? 
```{r cache=TRUE}
findFreqTerms(tdmBigramStemmed, 40)
```

A quick interesting note, around the fact that I started with a very small subset of the initial data - 10K blogs. Looking at the stemmed vocabulary (size of Unigrams), it shows a fairly large set, of around 20K words, which is already in the range of a normal person's vocabulary size. It looks like we might not have to go to the largest extent of the initial data set, when analyzing Unigrams. 
Further, in regards to the Bigrams, there was an almost ignorable difference in regards to the stemmed vs. non-stemmed sets of words. In regards to the very-high frequency Bigrams, they seem to be made with the very short simple words (which look the same stemmed). This would encourage use to use stemming, since the matrix would be a lot smaller. 

Some distributions of frequencies, for Unigrams and Bigrams: 

```{r cache=TRUE, echo=FALSE}
unigrams <- inspect(tdmUnigram)
tblUnigrams <- table(unigrams)
hist (log(tblUnigrams), main="Histogram of Unigrams", breaks=50)
```

```{r cache=TRUE, echo=FALSE}
bigrams <- inspect(tdmBigram)
tblBigrams <- table(bigrams)
hist (log(tblBigrams), main="Histogram of Unigrams", breaks=50)
```