-
Notifications
You must be signed in to change notification settings - Fork 109
/
Copy pathintro_to_ggparty.Rmd
288 lines (227 loc) · 14.6 KB
/
intro_to_ggparty.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# Introduction to package 'ggparty'
Tianyao Han (th2830) and Wancheng Chen (wc2687)
```{r, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
```
```{r}
# import packages
library(dplyr)
library(partykit)
library(ggparty)
library(ggplot2)
library(party)
```
We have already known how to visualize the distribution information of a single numerical feature using graphs like histogram, boxplot, and violin plot. Using QQ plot to test if this feature has a nromal distribution. Showing relationships between nummerical features using ridgeline plots or scatter plots or with the help of a heatmap. When talking about categorical features, we have mosiac plots, which is quite useful, to deal with.
The relationship between features can be somewhat not easy to figure out, like the “diagnal line” problem in the previous homework, the true linear relationship between humid and temp can not be easily interpreted until we find the hiden relation between temp and dew point. In other words, it is often the case that the overall linear relation between two variables are not so significant until we find a proper way to set a partition on our data set. That is, the property of other features plays a role of adjusting or even controling the relation between our interested features.
Under this circumstance, we can introduce package 'ggparty' to help us do this part. The ggparty package aims to extend ggplot2 functionality to the partykit package. It provides the necessary tools to create clearly structured and highly customizable visualizations for tree-objects of the class 'party'.
## Introdunction of class 'party'
```{r}
# Use WeatherPlay dataset from partykit package
data("WeatherPlay", package = "partykit")
WeatherPlay
```
To represent the data as a tree-based object in partykit, two basic building blocks are used: splits of class ‘partysplit’ and nodes of class ‘partynode’. The resulting recursive partition can then be associated with a data set in an object of class ‘party’.
1. Use 'partysplit' function to define split methods for the column: \
* partysplit(varid, breaks = NULL, index = NULL, right = TRUE, prob = NULL, info = NULL) \
The 'varid' means the number of the column that will be split in the dataset (e.g., 1L for outlook, 3L for humidity, 4L for windy etc.). There are two ways to split the column data, one is to split by specfic value for continuous data and one is to split by type for categorial data. Some arbitrary information can be associated with a ‘partysplit’ object by passing it to the 'info' argument.
```{r}
sp_o <- partysplit(1L, index = 1:3)
sp_h <- partysplit(3L, breaks = 75)
sp_w <- partysplit(4L, index = 1:2)
```
2. Use 'partynode' function to define the split methods for nodes in the tree: \
* partynode(id, split = NULL, kids = NULL, surrogates = NULL, info = NULL) \
The 'split' means the partysplit method mentioned above.'id' is an integer identifier of the node number, 'split' is a ‘partysplit’ object, and 'kids' is a list of ‘partynode’ objects. Some arbitrary information can be supplied in 'info' argument.
```{r}
pn <- partynode(1L, split = sp_o, kids = list(
partynode(2L, split = sp_h, kids = list(
partynode(3L, info = "yes"),
partynode(4L, info = "no"))),
partynode(5L, info = "yes"),
partynode(6L, split = sp_w, kids = list(
partynode(7L, info = "yes"),
partynode(8L, info = "no")))))
py <- party(pn, WeatherPlay)
```
In the end, the 'party' function will split the data and record the splitting process in the shape of a tree.
```{r}
print(py)
```
```{r}
plot(py)
```
## Use 'ggparty' to visualize the tree
Use ggparty function to visualize the object of class 'party' into a tree-based type.
* geom_edge() draws the edges between the nodes \
* geom_edge_label() labels the edges with the corresponding split breaks \
* geom_node_label() labels the nodes with the split variable, node info or anything else. The shorthand versions of this geom geom_node_splitvar() and geom_node_info() have the correct defaults to write the split variables in the inner nodes or the info in the terminal nodes \
* geom_node_plot() creates a custom ggplot at the location of the node \
```{r}
ggparty(py) +
geom_edge() +
geom_edge_label() +
geom_node_label(aes(label = splitvar),
ids = "inner") +
geom_node_label(aes(label = info),
ids = "terminal")
```
## Customize the tree
Change color and size of the nodes.
```{r}
ggparty(py) +
geom_edge() +
geom_edge_label() +
# map color to level and size to nodesize for all nodes
geom_node_splitvar(aes(col = factor(level),
size = nodesize)) +
geom_node_info(aes(col = factor(level),
size = nodesize))
```
## Add plots to the tree
Generate a new 'party' object. Set 'fitted' and 'response' variables for the 'party' object. That is the two variables that we want to find the correlation.
```{r}
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:4L, partynode))
t2 <- party(n1,
data = WeatherPlay,
fitted = data.frame(
"(fitted)" = fitted_node(n1, data = WeatherPlay),
"(response)" = WeatherPlay$play,
check.names = FALSE),
terms = terms(play ~ ., data = WeatherPlay)
)
t2 <- as.constparty(t2)
```
New split type for the dataset. Adding fitted and response variables.
```{r}
print(t2)
```
To visualize the distribution of the variable play we will use the geom_node_plot() function. It allows us to show the data of each node in its separate plot. For this to work, we have to specify the argument gglist. Basically we have to provide a 'list' of all the 'gg' components we would add to a ggplot() call on the data element of a node. Because the response variable is categorial, use barchart to do the visualization part.
```{r}
ggparty(t2) +
geom_edge() +
geom_edge_label() +
geom_node_splitvar() +
geom_node_plot(gglist = list(geom_bar(aes(x = "", fill = play),
position = position_fill()),
xlab("play")))
```
## Application
So how to dig this hiden information out, with the help of explortary data analysis? Our group will show you how to use recursive partition method and visualization techniques to find some interesting patterns that are difficult to find out without knowing this technique.
### Categorical vs Numerical
We first will show you how to automatically visualize impact of other variables on the relation between categirical and numerical variables. The dataset we will use is Pima Indians Diabetes Database.
```{r}
data("PimaIndiansDiabetes2", package = "mlbench")
head(PimaIndiansDiabetes2)
```
In this dataset, it is natrual to think that plasma glucose concentration glucose is an important predictor for diabetes. Suppose we want to explore the relation between glocose and diabetes. We first take the summary of the data. And result shows that we have 768 observations but variable tricpes and insulin have too many NAs so we decide not using these two variables. Then we delete NAs of other variables and get our dataframe PimaIndianDiabetes.
```{r}
str(PimaIndiansDiabetes2)
summary(PimaIndiansDiabetes2)
PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)])
```
We first want to use ggplot2 to show the relation. Diabetes is a categorical variable while glucose is a numerical variable. So we use density curve and boxplot to show. From these two plots we can get information:
1. People with positive diabetes have higher glucose than people have negative diabetes.
2. More outliers(larger than normal) occur in glucose distribution under people with negative diabetes.
3. The glucose distribution under people with positive diabetes seem to have 2 peeks and is not normally distributed.
From these information we find that although generally people with positive diabetes have higher glocose, but how other varibles can affect this property(so that there are two peeks in glucose distribution of people with positive diabetes)? It is hard to tell from the plots we get so far. Maybe we can try to facet on other variables to see what happens. But when other variables are numerical, we have to first do discretization, and how wide the bin will be can just be another problem. So here is time for using Model-Based Recursive Partitioning method to help us find what the significant node is and how to visualize it.
```{r}
ggplot(PimaIndiansDiabetes, aes(glucose, fill = diabetes, colour = diabetes)) +
geom_density(alpha = 0.1) +
xlim(0,250) +
ggtitle("glucose distribution under different diabetes type")
```
```{r}
ggplot(PimaIndiansDiabetes, aes(x=diabetes, y=glucose, fill=diabetes))+
geom_boxplot()+
ggtitle("glucose range under diffrent diabete type")
```
From these information we get we find that although generally people with positive diabetes have higher glocose, but how other varibles can affect this property(so that there are two peeks in glucose distribution of people with positive diabetes)? It is hard to tell from the plots we get so far. Maybe we can try to facet on other variables to see what happens. But when other variables are numerical, we have to first do discretization, and how wide the bin will be can just be another problem. So here is time for using Model-Based Recursive Partitioning method to help us find what the significant node is and visualize it. Here we will use mob function from party library. We can directly read the print result to get some interesting information. This tree model shows three patterns about relation of glucose and diabetes.
Node 2 Women with low body mass index that have on average a low risk of diabetes, however this increases clearly with glucose level. Node 4 Women with average and high body mass index, younger than 30 years, that have a higher avarage risk that also increases with glucose level. Node 5 Women with average and high body mass index, older than 30 years, that have a high avarage risk that increases only slowly with glucose level.
```{r}
#tree
fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age,
data = PimaIndiansDiabetes, model = glinearModel, family = binomial())
#show information
print(fmPID)
```
Printed information are always bad for illustration, we can simply use plot function to get a better result. And the three patternd talked are quite obvious in the graph.
```{r}
# plot it
plot(fmPID)
```
We can compute the odds ratio for each group. On average, the odds increase by 6.7%, 4.6% and 2.8% with respect to glucose in the three groups. There is way more information than just a single boxplot or distribution plot.
```{r}
# show odds ratios
exp(coef(fmPID)[,2])
```
### Numerical vs Numerical
PimaIndianDiabetes dataset shows the case to explore relation between one numerical variable and one categorical variable. When we have two numerical variables, we can also use this method to show more detailed patterns. This time we will use ggparty to make our plot more pretty.
TeachingRatings is a dataset on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000–2002 at the University of Texas at Austin.
```{r}
data("TeachingRatings", package = "AER")
head(TeachingRatings)
```
Now we want to know if there relation between evaluation scores and teachers’ physical appearance. We first use a scatter plot to show that. We can generally see that beauty and eval has a positive corelation. As teachers’ rating of physical appearence goes higher, the evaluation score also goes higher.
```{r}
ggplot(TeachingRatings, aes(x=beauty, y=eval)) +
geom_point() +
ggtitle("scatter plot on beauty and eval")
```
Faceted by minority and division, and lengended by gender, we see more detailed patterns betwenn beauty and eval.
```{r}
ggplot(data=TeachingRatings,aes(x=beauty, y=eval, color=gender))+
geom_point(alpha=0.5, size=0.8)+
geom_smooth(method=lm, se=FALSE)+
facet_grid(minority ~ division, margins=TRUE)+
labs(title="scatter plot on beauty and eval with gender as legend",
subtitle = "faeted by minority and division")
```
If we want to see whether some categorical variables in the dataset also can affect the coreltion we found so far, traditionally we can use facet and legend by hand with ggplot2. But what variable to use is a big problem if there are many. Things can be more complicated when we want to see if a numerical variable has an effect on out found corelation. We have choose a proper bin to change the numericla variable to a factor. However, things can be much more easier if we use Model-Based Recursive Partitioning method. Here we will use lmtree method in partykit package and plot the result.
```{r fig.width=10, fig.height=8}
data("TeachingRatings", package = "AER")
tr_tree <- lmtree(eval ~ beauty | minority + age + gender + division + native + tenure, data = TeachingRatings, weights = students, caseweights = FALSE)
plot(tr_tree)
```
To make our plot prettier, we use package from ggparty. An interesting pattern occurs in node 8. More beautiful a female instructor more than 40 years old teaching an upper class is, lower evaluation score she has.
```{r fig.width=12, fig.height=9}
ggparty(tr_tree,
terminal_space = 0.5,
add_vars = list(p.value = "$node$info$p.value")) +
geom_edge(size = 1.5) +
geom_edge_label(colour = "grey", size = 6) +
geom_node_plot(gglist = list(geom_point(aes(x = beauty,
y = eval),
alpha = 0.5),
theme_bw(base_size = 15)),
scales = "fixed",
id = "terminal",
shared_axis_labels = T,
shared_legend = T,
legend_separator = T,
predict = "beauty",
predict_gpar = list(col = "blue",
size = 1.2)
) +
geom_node_label(aes(col = splitvar),
line_list = list(aes(label = paste("Node", id)),
aes(label = splitvar),
aes(label = paste("p =", formatC(p.value, format = "e", digits = 2)))),
line_gpar = list(list(size = 10, col = "black", fontface = "bold"),
list(size = 18),
list(size = 10)),
ids = "inner") +
geom_node_label(aes(label = paste0("Node ", id, ", N = ", nodesize)),
fontface = "bold",
ids = "terminal",
size = 5,
nudge_y = 0.01) +
theme(legend.position = "none")
```
And we can take the summary of node 8 to further explore.
```{r}
summary(tr_tree, node = 8)
```