intro_to_ggparty.Rmd

# Introduction to package 'ggparty'

Tianyao Han (th2830) and Wancheng Chen (wc2687)

```{r, include=FALSE}
knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE
)
```

```{r}
# import packages
library(dplyr)
library(partykit)
library(ggparty)
library(ggplot2)
library(party)
```

We have already known how to visualize the distribution information of a single numerical feature using graphs like histogram, boxplot, and violin plot. Using QQ plot to test if this feature has a nromal distribution. Showing relationships between nummerical features using ridgeline plots or scatter plots or with the help of a heatmap. When talking about categorical features, we have mosiac plots, which is quite useful, to deal with. 

The relationship between features can be somewhat not easy to figure out, like the “diagnal line” problem in the previous homework, the true linear relationship between humid and temp can not be easily interpreted until we find the hiden relation between temp and dew point. In other words, it is often the case that the overall linear relation between two variables are not so significant until we find a proper way to set a partition on our data set. That is, the property of other features plays a role of adjusting or even controling the relation between our interested features.

Under this circumstance, we can introduce package 'ggparty' to help us do this part. The ggparty package aims to extend ggplot2 functionality to the partykit package. It provides the necessary tools to create clearly structured and highly customizable visualizations for tree-objects of the class 'party'.

## Introdunction of class 'party' 
```{r}
# Use WeatherPlay dataset from partykit package
data("WeatherPlay", package = "partykit")
WeatherPlay
```

To represent the data as a tree-based object in partykit, two basic building blocks are used: splits of class ‘partysplit’ and nodes of class ‘partynode’. The resulting recursive partition can then be associated with a data set in an object of class ‘party’.

1. Use 'partysplit' function to define split methods for the column: \
* partysplit(varid, breaks = NULL, index = NULL, right = TRUE, prob = NULL, info = NULL) \
The 'varid' means the number of the column that will be split in the dataset (e.g., 1L for outlook, 3L for humidity, 4L for windy etc.). There are two ways to split the column data, one is to split by specfic value for continuous data and one is to split by type for categorial data. Some arbitrary information can be associated with a ‘partysplit’ object by passing it to the 'info' argument.

```{r}
sp_o <- partysplit(1L, index = 1:3) 
sp_h <- partysplit(3L, breaks = 75)
sp_w <- partysplit(4L, index = 1:2)
```

2. Use 'partynode' function to define the split methods for nodes in the tree: \
* partynode(id, split = NULL, kids = NULL, surrogates = NULL, info = NULL)  \
The 'split' means the partysplit method mentioned above.'id' is an integer identifier of the node number, 'split' is a ‘partysplit’ object, and 'kids' is a list of ‘partynode’ objects. Some arbitrary information can be supplied in 'info' argument.
```{r}
pn <- partynode(1L, split = sp_o, kids = list(
  partynode(2L, split = sp_h, kids = list(
    partynode(3L, info = "yes"),
    partynode(4L, info = "no"))),
  partynode(5L, info = "yes"),
  partynode(6L, split = sp_w, kids = list(
    partynode(7L, info = "yes"),
    partynode(8L, info = "no")))))
py <- party(pn, WeatherPlay)
```

In the end, the 'party' function will split the data and record the splitting process in the shape of a tree.
```{r}
print(py)
```

```{r}
plot(py)
```

## Use 'ggparty' to visualize the tree
Use ggparty function to visualize the object of class 'party' into a tree-based type.
* geom_edge() draws the edges between the nodes \
* geom_edge_label() labels the edges with the corresponding split breaks \
* geom_node_label() labels the nodes with the split variable, node info or anything else. The shorthand versions of this geom geom_node_splitvar() and geom_node_info() have the correct defaults to write the split variables in the inner nodes or the info in the terminal nodes \
* geom_node_plot() creates a custom ggplot at the location of the node \
```{r}
ggparty(py) +
  geom_edge() +
  geom_edge_label() +
  geom_node_label(aes(label = splitvar),
                  ids = "inner") +
  geom_node_label(aes(label = info),
                  ids = "terminal")
```

## Customize the tree
Change color and size of the nodes.
```{r}
ggparty(py) +
  geom_edge() +
  geom_edge_label() +
  # map color to level and size to nodesize for all nodes
  geom_node_splitvar(aes(col = factor(level),
                         size = nodesize)) +
  geom_node_info(aes(col = factor(level),
                     size = nodesize))
```

## Add plots to the tree
Generate a new 'party' object. Set 'fitted' and 'response' variables for the 'party' object. That is the two variables that we want to find the correlation.
```{r}
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:4L, partynode))
t2 <- party(n1,
            data = WeatherPlay,
            fitted = data.frame(
              "(fitted)" = fitted_node(n1, data = WeatherPlay),
              "(response)" = WeatherPlay$play,
              check.names = FALSE),
            terms = terms(play ~ ., data = WeatherPlay)
)
t2 <- as.constparty(t2)
```

New split type for the dataset. Adding fitted and response variables.
```{r}
print(t2)
```

To visualize the distribution of the variable play we will use the geom_node_plot() function. It allows us to show the data of each node in its separate plot. For this to work, we have to specify the argument gglist. Basically we have to provide a 'list' of all the 'gg' components we would add to a ggplot() call on the data element of a node. Because the response variable is categorial, use barchart to do the visualization part.
```{r}
ggparty(t2) +
  geom_edge() +
  geom_edge_label() +
  geom_node_splitvar() +
  geom_node_plot(gglist = list(geom_bar(aes(x = "", fill = play),
                                        position = position_fill()),
                               xlab("play")))
```

## Application

So how to dig this hiden information out, with the help of explortary data analysis? Our group will show you how to use recursive partition method and visualization techniques to find some interesting patterns that are difficult to find out without knowing this technique.

### Categorical vs Numerical

We first will show you how to automatically visualize impact of other variables on the relation between categirical and numerical variables. The dataset we will use is Pima Indians Diabetes Database.

```{r}
data("PimaIndiansDiabetes2", package = "mlbench")
head(PimaIndiansDiabetes2)
```


In this dataset, it is natrual to think that plasma glucose concentration glucose is an important predictor for diabetes. Suppose we want to explore the relation between glocose and diabetes. We first take the summary of the data. And result shows that we have 768 observations but variable tricpes and insulin have too many NAs so we decide not using these two variables. Then we delete NAs of other variables and get our dataframe PimaIndianDiabetes.

```{r}
str(PimaIndiansDiabetes2)
summary(PimaIndiansDiabetes2)
PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)])
```

We first want to use ggplot2 to show the relation. Diabetes is a categorical variable while glucose is a numerical variable. So we use density curve and boxplot to show. From these two plots we can get information:

1. People with positive diabetes have higher glucose than people have negative diabetes.
2. More outliers(larger than normal) occur in glucose distribution under people with negative diabetes.
3. The glucose distribution under people with positive diabetes seem to have 2 peeks and is not normally distributed.

From these information we find that although generally people with positive diabetes have higher glocose, but how other varibles can affect this property(so that there are two peeks in glucose distribution of people with positive diabetes)? It is hard to tell from the plots we get so far. Maybe we can try to facet on other variables to see what happens. But when other variables are numerical, we have to first do discretization, and how wide the bin will be can just be another problem. So here is time for using Model-Based Recursive Partitioning method to help us find what the significant node is and how to visualize it.

```{r}
ggplot(PimaIndiansDiabetes, aes(glucose, fill = diabetes, colour = diabetes)) +
  geom_density(alpha = 0.1) + 
  xlim(0,250) + 
  ggtitle("glucose distribution under different diabetes type")
```

```{r}
ggplot(PimaIndiansDiabetes, aes(x=diabetes, y=glucose, fill=diabetes))+
  geom_boxplot()+
  ggtitle("glucose range under diffrent diabete type")
```

From these information we get we find that although generally people with positive diabetes have higher glocose, but how other varibles can affect this property(so that there are two peeks in glucose distribution of people with positive diabetes)? It is hard to tell from the plots we get so far. Maybe we can try to facet on other variables to see what happens. But when other variables are numerical, we have to first do discretization, and how wide the bin will be can just be another problem. So here is time for using Model-Based Recursive Partitioning method to help us find what the significant node is and visualize it. Here we will use mob function from party library. We can directly read the print result to get some interesting information. This tree model shows three patterns about relation of glucose and diabetes.

Node 2 Women with low body mass index that have on average a low risk of diabetes, however this increases clearly with glucose level. Node 4 Women with average and high body mass index, younger than 30 years, that have a higher avarage risk that also increases with glucose level. Node 5 Women with average and high body mass index, older than 30 years, that have a high avarage risk that increases only slowly with glucose level.

```{r}
#tree
fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age,
data = PimaIndiansDiabetes, model = glinearModel, family = binomial())

#show information
print(fmPID)
```

Printed information are always bad for illustration, we can simply use plot function to get a better result. And the three patternd talked are quite obvious in the graph.

```{r}
# plot it 
plot(fmPID)
```

We can compute the odds ratio for each group. On average, the odds increase by 6.7%, 4.6% and 2.8% with respect to glucose in the three groups. There is way more information than just a single boxplot or distribution plot.

```{r}
# show odds ratios
exp(coef(fmPID)[,2])
```

### Numerical vs Numerical

PimaIndianDiabetes dataset shows the case to explore relation between one numerical variable and one categorical variable. When we have two numerical variables, we can also use this method to show more detailed patterns. This time we will use ggparty to make our plot more pretty. 

TeachingRatings is a dataset on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin. 

```{r}
data("TeachingRatings", package = "AER")
head(TeachingRatings)
```


Now we want to know if there relation between evaluation scores and teachers’ physical appearance. We first use a scatter plot to show that. We can generally see that beauty and eval has a positive corelation. As teachers’ rating of physical appearence goes higher, the evaluation score also goes higher.

```{r}
ggplot(TeachingRatings, aes(x=beauty, y=eval)) + 
  geom_point() + 
  ggtitle("scatter plot on beauty and eval")
```

Faceted by minority and division, and lengended by gender, we see more detailed patterns betwenn beauty and eval.

```{r}
ggplot(data=TeachingRatings,aes(x=beauty, y=eval, color=gender))+
  geom_point(alpha=0.5, size=0.8)+
  geom_smooth(method=lm, se=FALSE)+
  facet_grid(minority ~ division, margins=TRUE)+
  labs(title="scatter plot on beauty and eval with gender as legend",
       subtitle = "faeted by minority and division")
```

If we want to see whether some categorical variables in the dataset also can affect the coreltion we found so far, traditionally we can use facet and legend by hand with ggplot2. But what variable to use is a big problem if there are many. Things can be more complicated when we want to see if a numerical variable has an effect on out found corelation. We have choose a proper bin to change the numericla variable to a factor. However, things can be much more easier if we use Model-Based Recursive Partitioning method. Here we will use lmtree method in partykit package and plot the result.

```{r fig.width=10, fig.height=8} 
data("TeachingRatings", package = "AER")

tr_tree <- lmtree(eval ~ beauty | minority + age + gender + division + native + tenure, data = TeachingRatings, weights = students, caseweights = FALSE)

plot(tr_tree)
```

To make our plot prettier, we use package from ggparty. An interesting pattern occurs in node 8. More beautiful a female instructor more than 40 years old teaching an upper class is, lower evaluation score she has.

```{r fig.width=12, fig.height=9} 
ggparty(tr_tree,
        terminal_space = 0.5,
        add_vars = list(p.value = "$node$info$p.value")) +
  geom_edge(size = 1.5) +
  geom_edge_label(colour = "grey", size = 6) +
  geom_node_plot(gglist = list(geom_point(aes(x = beauty,
                                             y = eval),
                                         alpha = 0.5),
                              theme_bw(base_size = 15)),
                scales = "fixed",
                id = "terminal",
                shared_axis_labels = T,
                shared_legend = T,
                legend_separator = T,
                predict = "beauty",
                predict_gpar = list(col = "blue",
                                   size = 1.2)
                ) +
  geom_node_label(aes(col = splitvar),
                  line_list = list(aes(label = paste("Node", id)),
                                   aes(label = splitvar),
                                   aes(label = paste("p =", formatC(p.value, format = "e", digits = 2)))),
                  line_gpar = list(list(size = 10, col = "black", fontface = "bold"),
                                   list(size = 18),
                                   list(size = 10)),
                  ids = "inner") +
  geom_node_label(aes(label = paste0("Node ", id, ", N = ", nodesize)),
                  fontface = "bold",
                  ids = "terminal",
                  size = 5, 
                  nudge_y = 0.01) +
  theme(legend.position = "none")

```

And we can take the summary of node 8 to further explore.

```{r}
summary(tr_tree, node = 8)
```