tweets_tutorial.Rmd

---
title: "patents_tutorial"
author: "Prashant Garg"
date: "2023-04-26"
output: html_document
---

################################################
#
#         2023 ML Workshop
#
#  Predicting, evaluating, and interpreting
#
#
################################################

install libraries (only once)
```{r}
# Un-comment and run these once, if you haven't installed them before
# install.packages("tidyverse") # useful everything
install.packages("ggrepel") # nice labels for plots
install.packages("glmnet") # supervised machine learning
install.packages("quanteda") # very useful text stuff
install.packages("textclean") # useful text stuff
install.packages("stm") # topic models
install.packages("textdata") # dictionaries
install.packages("syuzhet") # sentiment
install.packages("doc2concrete") # ngram model
install.packages("politeness") # dialogue acts
install.packages("spacyr") # grammar parsing
```

load libraries
```{r}
# Run these every time
library(tidyverse)
library(ggrepel)
library(glmnet)
library(quanteda)
library(textclean)
library(stm) 
library(textdata)
library(syuzhet)
library(doc2concrete)
library(politeness)
library(spacyr)
```

Here I am loading separate R scripts that contain functions we will use.
make sure these are all in your Rstudio project
All in the main folder, along with this script!
```{r}
source("create_dfm.R") # our dfm function as a separate script
source("kendall_acc.R") # our accuracy function
source("vectorFunctions.R") # for word embeddings
```

load and explore data
```{r}
# patents data 
tweets_data <- readRDS("tweets_data.RDS")

# First thing - check variables
names(tweets_data)
```

Calculate a 1-gram feature count matrix for the review data, with no dropped words
```{r}
dfm1<-create_dfm(tweets_data$text,
                ngrams=1,
                min.prop=0,
                stop.words = FALSE)

dim(dfm1) 
```
More than 42k ngrams! Too many

most common words - obvious
```{r}
sort(colMeans(dfm1),decreasing=TRUE)[1:20]
```

least common words
```{r}
sort(colMeans(dfm1))[1:20]
```

Let's put this on a plot
```{r}
ngram_counts<-data.frame(word=colnames(dfm1),
                         frequency=colMeans(dfm1),
                         f_rank=rank(-colMeans(dfm1),
                                     ties.method="first"))

ngram_counts %>%
  ggplot(aes(x=f_rank,y=frequency)) +
  geom_point()
```

Let's write the words on the plot
```{r}
ngram_counts %>%
  ggplot(aes(x=f_rank,y=frequency,label=word)) +
  geom_text()
```

That's a mess... let's create a new column and pick only a few words to plot
```{r}
ngram_counts <- ngram_counts %>%
  mutate(word_label=ifelse(f_rank%in%c(1:10, # top 10 words
                                       sample(11:200,10), # ten random words from 11-100
                                       sample(201:1000,20), # twenty random words from 101-1000
                                       sample(1000:n(),20)), # twenty random words above 1000
                           word,""))
```
better, but still messy.

```{r}
ngram_counts %>%
  ggplot(aes(x=f_rank,y=frequency,label=word_label)) +
  geom_text()
```

Let's get rid of the rarest words and plot again
```{r}
ngram_counts %>%
  filter(f_rank<1000) %>%
  ggplot(aes(x=f_rank,y=frequency,label=word_label)) +
  geom_text()
```
still some overlap.. we'll use the ggrepel package to space out the words

We'll also add the points for all the words, just to see

```{r}
ngram_counts %>%
  filter(f_rank<1000) %>%
  ggplot(aes(x=f_rank,y=frequency,label=word_label)) +
  geom_point(color="red") +
  geom_label_repel(max.overlaps=100,size=7,
                   nudge_x = 1,nudge_y = 1,
                   force=3) + 
  labs(x="Frequency Rank",y="Frequency per Tweet") +
  theme_bw() +
  theme(axis.title = element_text(size=24),
        axis.text = element_text(size=24))
```

(optional) save the plot
```{r}
# ggsave("word_freq.png")
```

# prediction

Ok, let's build a model to predict whether tweet is clean or dirty!

First, let's look at our classification data
```{r}
table(tweets_data$label)
```

Let's only use 1-grams for now
```{r}
dfm3<-create_dfm(tweets_data$text,ngrams=1) %>%
  convert(to="data.frame") %>%
  select(-doc_id)

# Lots of words
dim(dfm3)
```

Most common words in clean and dirty tweets... 
```{r}
sort(colMeans(dfm3[tweets_data$label==1,]),decreasing=T)[1:20]
```

lots of the same words!
```{r}
sort(colMeans(dfm3[tweets_data$label==0,]),decreasing=T)[1:20]
```

What we really care about is - does the presence of a word predict label?
```{r}
# A simple start - correlate each word with label
correlations<-dfm3 %>%
  summarise_all(~round(cor(.,tweets_data$label),3)) %>%
  unlist()
```

Ten lowest associations (dirty tweets)
```{r}
sort(correlations)[1:10]
```

Ten highest associations (clean tweets)
```{r}
rev(sort(correlations))[1:10]
```

note - same as:
```{r}
sort(correlations,decreasing=TRUE)[1:10]
```

Let's put this on a plot
```{r}
cor_set<-data.frame(correlation=correlations,
                    frequency=colMeans(dfm3),
                    word=colnames(dfm3)) %>%
  # # let's group the points so we can add color
  # mutate(colour=case_when(
  #   correlation>.02  ~ 1, # positive correlations (> +.02)
  #   correlation<(-.02) ~ 0, # negative correlations (< -.02)
  #   T ~ "neutral"))               # black otherwise (i.e. if it is close to zero)
  
  mutate(colour=case_when(
    correlation>.02  ~ "high", # positive correlations (> +.02)
    correlation<(-.02) ~ "low", # negative correlations (< -.02)
    T ~ "none"))  
head(cor_set)
```

Let's make a quick plot
```{r}
cor_set %>%
  ggplot(aes(x=correlation,y=frequency,label=word,color=colour)) +
  geom_point() +
  geom_label_repel()+  
  theme_bw()
```

A few problems with this plot:

- The repel function is removing too many words
- We probably don't want to label the neutral words
- The points are all compressed at the bottom
- The colours are chosen automatically
- Axis labels are too small and don't make much sense
- We don't really need a legend here
- It would be nice to have a dark line indicating zero 
- The other grid lines are not that useful

Here's a fixed version
```{r}
cor_set %>%
  mutate(word=ifelse(abs(correlation)>.02,word,NA)) %>% # get rid of labels for all neutral words
  ggplot(aes(x=correlation,y=frequency,label=word,color=colour)) +
  scale_color_manual(breaks=c("low","none","high"),
                     values=c("brown4","gray","green3"))+  # manually assign colors to each category
  geom_vline(xintercept=0)+ # Add dark line at zero
  geom_point() +
  geom_label_repel(max.overlaps=16,force=3)+ # tolerate more overlapping words and force them apart
  scale_y_continuous(trans="log2",  # log-transform the Y axis so it's not compressed at the bottom
                     breaks=c(.01,.05,.1,.2,.5,1,2,5))+ # manually set axis ticks
  theme_bw() +
  labs(x="Correlation with label",y="Uses per tweets")+ # Write in proper axis titles
  theme(legend.position = "none",         # get rid of legend
        panel.grid = element_blank(),     # get rid of grid lines
        axis.title=element_text(size=20), # increase size of axis titles
        axis.text=element_text(size=16))  # increase size of axis text
```

##################################################

As we will discuss, we are not interested in effects of individual words
Instead, we care more about how all the words perform as a class

To do this, we will use the cv.glmnet() function to build a model

First, we need to split the data into training and testing samples
```{r}
train_split=sample(1:nrow(tweets_data),round(nrow(tweets_data)/2))

length(train_split)
```

create our prediction variables
```{r}
dfm3<-create_dfm(tweets_data$text,ngrams=1) %>%
  convert(to="data.frame") %>%
  select(-doc_id)


trainX<-dfm3 %>%
  slice(train_split) %>%
  as.matrix() # we convert to matrix format for glmnet

trainY<-tweets_data %>%
  slice(train_split) %>%
  pull(label)

testX<-dfm3 %>% 
  slice(-train_split) %>%
  as.matrix() # we convert to matrix format for glmnet

testY<-tweets_data %>%
  slice(-train_split) %>%
  pull(label)
```

Put training data into LASSO model (note - glmnet requires a matrix)
```{r}
lasso_mod<-cv.glmnet(x=trainX,y=trainY)
```

generate predictions for test data
```{r}
test_predict<-predict(lasso_mod,newx = testX)[,1]
```

Note that while the true answers are binary, the predictions are continuous
Always check these distributions!!
```{r}
# true values
hist(testY)
```

```{r}
# predicted values
hist(test_predict)
```

For now, let's just split the predictions in two, using the median
```{r}
test_predict_binary=ifelse(test_predict>median(test_predict),
                           1,0)
```

This should have the same values as testY
```{r}
hist(test_predict_binary)
```

and we can calculate accuracy from that
```{r}
round(100*mean(test_predict_binary==testY),3)
```

#######################################################
Let's get a little better at interpreting the model



instead of splitting the dfm we can split the data itself
```{r}
tweets_data_train<-tweets_data[train_split,]
tweets_data_test<-tweets_data[-train_split,]

tweets_data_dfm_train<-create_dfm(tweets_data_train$text,ngrams=1)

tweets_data_dfm_test<-create_dfm(tweets_data_test$text,ngrams=1) %>%
  dfm_match(colnames(tweets_data_dfm_train))
```


```{r}
tweet_model <-glmnet::cv.glmnet(x=tweets_data_dfm_train %>%
                               as.matrix(),
                             y=tweets_data_train$label)

plot(tweet_model)
```

Interpret with a coefficient plot
This plots coefficients from the model, rather than raw correlations
any fewer features, and more focused on what the model is actually doing
```{r}
plotDat<-tweet_model %>%
  coef() %>%
  drop() %>%
  as.data.frame() %>%
  rownames_to_column(var = "ngram") %>%
  rename(score=".") %>%
  filter(score!=0 & ngram!="(Intercept)" & !is.na(score))  %>%
  # add ngram frequencies for plotting
  left_join(data.frame(ngram=colnames(tweets_data_dfm_train),
                       freq=colMeans(tweets_data_dfm_train)))

plotDat %>%
  mutate_at(vars(score,freq),~round(.,3))

plotDat %>%
  ggplot(aes(x=score,y=freq,label=ngram,color=score)) +
  scale_color_gradient(low="blue",high="red")+
  geom_vline(xintercept=0)+
  geom_point() +
  geom_label_repel(max.overlaps = 30,force = 6)+  
  scale_y_continuous(trans="log2",
                     breaks=c(.01,.05,.1,.2,.5,1,2,5))+
  theme_bw() +
  labs(x="Coefficient in Model",y="Uses per Tweet")+
  theme(legend.position = "none",
        axis.title=element_text(size=20),
        axis.text=element_text(size=16))
```

# Evaluate Accuracy

we are going to use a function that calculates non-parametric correlations
this function reports percentage accuracy, where 50% is random guessing
it is calculated across all possible pairs of observations
this is the same function that is in your "kendall_acc.R" script
```{r}
kendall_acc<-function(x,y,percentage=TRUE){
  kt=cor(x,y,method="kendall")
  kt.acc=.5+kt/2
  kt.se=sqrt((kt.acc*(1-kt.acc))/length(x))
  report=data.frame(acc=kt.acc,
                    lower=kt.acc-1.96*kt.se,
                    upper=kt.acc+1.96*kt.se)
  report = round(report,4)
  if(percentage) report = report*100
  return(report)
}
```


```{r}
test_ngram_predict<-predict(tweet_model,
                            newx = tweets_data_dfm_test %>%
                              as.matrix())[,1]

acc_ngram<-kendall_acc(tweets_data_test$label,test_ngram_predict)

acc_ngram
```

Find examples
```{r}
# store predictions in data, calculate accuracy
tweets_data_test<-tweets_data_test %>%
  mutate(prediction=test_ngram_predict,
         error=abs(label-prediction),
         bias=label-prediction)

close_green<-tweets_data_test %>%
  filter(label==1 & error<.5) %>% 
  select(text,label,prediction)

close_dirty<-tweets_data_test %>%
  filter(label==0 & error<.5) %>%
  select(text,label,prediction)
```


```{r}
# close_green
close_green %>%
  slice(1:2) %>%
  pull(text)  # optionally, replace abstract with patent_title
```


```{r}
# close_dirty
close_dirty %>%
  slice(1:2) %>%
  pull(text) # optionally, replace abstract with patent_title
```


Error analysis - find biggest misses
```{r}
hist(tweets_data_test$prediction)
```


```{r}
hist(tweets_data_test$label)
```


```{r}
miss_green<-tweets_data_test %>%
  arrange(bias) %>%
  slice(1:10) %>%
  select(text,label,prediction)

miss_dirty<-tweets_data_test %>%
  arrange(-bias) %>%
  slice(1:10) %>%
  select(text,label,prediction)
```


```{r}
miss_dirty%>%
  slice(1:2) %>%
  pull(text)
```


```{r}
miss_green%>%
  slice(1:2) %>%
  pull(text)
```