-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4_networkInference.Rmd
317 lines (248 loc) · 12.3 KB
/
4_networkInference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# Network inference: Running regressions with network data {#nwinference}
## Why run a regression?
```{r include=FALSE}
knitr::opts_chunk$set(fig.path = 'mainimages/')
```
Regressions are powerful tools that help you figure out which variable is correlated with another variable. There are two main reasons to use regressions:
1. you can find patterns of associations between variables that you cannot spot with the nacked eye (mainly because data is always messy and there is never a perfect correlation between two variables that you can spot alone by looking the data)
2. you can include control-variables: i.e., variables that you know from theory/past research are associated with your dependent variable and you should therefore control for
## Which model should I use?
Since we are working with network data, the regression models you can use are a little bit more complicated that ordinary least squares regressions or generalized linear models.
Basically there are two main model-groups that can be used to achieve a differnt goal.
1. Group 1: Which factors explain the variance in a specific variable? (whilst controlling for network dependencies)
2. Group 2: Explaining the data generating process of a network (how does a network come to be? is it significantly different from a random network with equalsized nodes and edges?)
We'll focus on the first group, where we use standard regression tools to explain factors that correlate with a depenent variable (e.g., player performance) and control for network effects.
For the secound group: check out this wonderful summary of advances in inferential network analysis: @cranmer2017navigating
## Basic idea of Temporal Network Autocorrelation Model (TNAM)
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(statnet)
library(ggplot2)
library(GGally)
library(scales)
library(xergm)
library(texreg)
```
The network autocorrelation model can be used to run regressions on data that show an inherent dependency among observations. If you do not control for these dependencies, the standard errors are calculated wrongly and your results will not hold!
First, let's prepare the data:
```{r}
#read data
el <- read.table('data_literatur_varia/Edgelist_highSchoolfriends.csv',
sep = ";", header = TRUE)
dt <- read.table('data_literatur_varia/Students_highSchoolAttributes.csv',
sep = ";", header = TRUE)
```
The data comes from a Swiss high-school that I've surveyed in 2014. About 600 students were asked to complete a questionnaire on their friends, their grades, their values, beliefs, deviant behavior etc.
```{r, echo=FALSE, fig.cap = 'Friendship network at a Swiss high-school', fig.width=10, fig.height=8}
knitr::include_graphics(rep("data_literatur_varia/NW_year_marks_respondentonly.pdf"))
```
We'll create two networks: one for all sorts of friends (best friends, drinking buddies etc.) and one for school friends only:
```{r}
#create adjacency matrix for all sorts of friends (best friends, drinking buddies, school friends etc.)
students <- unique(c(el$studentID, el$friend.ID.code))
mat <- matrix(0, nrow =length(students) , ncol = length(students))
colnames(mat) <- as.character(students)
rownames(mat) <- as.character(students)
mat[cbind(as.character(el$studentID),
as.character(el$friend.ID.code))] <- 1
```
```{r}
#create adjacency matrix for school friends
matsf <- matrix(0, nrow =length(students) , ncol = length(students))
colnames(matsf) <- as.character(students)
rownames(matsf) <- as.character(students)
matsf[cbind(as.character(el$studentID[el$school.friend == 'Yes']),
as.character(el$friend.ID.code[el$school.friend == 'Yes']))] <- 1
```
Next we need to create an attribute file and add variables:
```{r}
#create attributes file that matches the adjacency matrix
att <- data.frame('studentID' = rownames(mat))
#add attributes
# Grades in Math and French-Class
att$grade.math.french <- dt$marks.french.math.num[match(att$studentID, dt$studentID)]
# Name of school class
att$class <- dt$class[match(att$studentID, dt$studentID)]
# Gender
att$gender <- dt$gender[match(att$studentID, dt$studentID)]
# Age
att$age <- dt$age[match(att$studentID, dt$studentID)]
# Migration background
att$migrationbg <- dt$migration.bg[match(att$studentID, dt$studentID)]
# Liking school
att$liking.school <- dt$likes.school.num[match(att$studentID, dt$studentID)]
# Leisure time
att$sport.hours <- dt$sport.hours[match(att$studentID, dt$studentID)]
att$tv.hours <- dt$hr.spent.TV.num[match(att$studentID, dt$studentID)]
att$friends.hours <- dt$hr.spent.friends.num[match(att$studentID, dt$studentID)]
att$reading.hours <- dt$hr.spent.reading.num[match(att$studentID, dt$studentID)]
# Values: Being successful & Earning a lot
att$agree.besuccessfull <- dt$agree.be.successful.num[match(att$studentID, dt$studentID)]
att$agree.earnalot <- dt$aggree.earn.alot.num[match(att$studentID, dt$studentID)]
# Values: Repecting elders & Parental monitoring
att$true.parentsinterested <- dt$true.parents.interested.num[match(att$studentID, dt$studentID)]
att$true.respectelderly <- dt$agree.respect.elderly.num[match(att$studentID, dt$studentID)]
```
And then we're done preparing.
## TNAM-terms
To capture network dependencies in the data, several different `tnam`-terms can be used to calculate new variables that may affect the dependent variable (or more neutrally: is correlated with the dependent variable).
1. Spatial Lags - where a standard variable is multiplied with an adjacency
matrix to test whether network autocorrelation exists, meaning whether the
behavior of my neighbors/friends affects my behavior (is correlated with
my behavior - no matter which way the correlation points, i.e., whether I
choose similar friends or whether my friends affect me or whether I
influence my friends).
2. Attribute similarity - using a spatial lag term, you compare the
dependent variable of your friends to your own value in the dependent variable
to see whether homophily (either influence or selection or both) is at work.
3. Network effects - you can test the effect of network
effects on the dependent variable, i.e., whether popularity affects a dependent variable or centrality scores correlate with the dependent variable.
Find out more on tnam-terms by typing:
```{r}
?'tnam-terms'
```
## Running TNAMs
### Purely exogenous factors - neglecting important network dependencies!
```{r}
fit1 <- lm(grade.math.french ~
age
+ gender,
data = att)
summary(fit1)
```
Now these results are not reliable since we do not control for network dependencies among observations (=students). SE are overestimated, p-values are too low and estimates are also biased if we do not control for the fact that these observations potentially influence each other.
### Checking for network autocorrelation
Let's add a netlag term for network dependency. This term measures whether
your own performance (= measured by the grades achieved in math and French class) correlates with the performance of your immediate friends.
With the `netlag()`-function we can easily estimate the (average) performance of an ego's friends.
```{r}
nldt <- netlag(att$grade.math.french, mat)
```
The netlag()-function creates a new data set with the following columns:
1. netlag.pathdist1 = performance of best friends
2. time = here a constant
3. node = nodeID
4. response = dependent variable = here performance
We will adopt the first variable into our attributes data set and include it in the regression:
```{r}
att$grade.friends <- netlag(att$grade.math.french, mat)[,1]
```
```{r}
fit2 <- lm(grade.math.french ~
age
+ gender
+ grade.friends,
data = att)
summary(fit2)
```
The normal `netlag()`-function uses no normalization and simply adds all the values of the friends together.
I prefer to use an average effect, where the performance is averaged over the number of friends I have.
```{r}
att$avg.grade.ofriends <- netlag(att$grade.math.french, mat, normalization = 'row')[,1]
att$avg.grade.ifriends <- netlag(att$grade.math.french, mat, normalization = 'column')[,1]
```
```{r}
fit3 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends,
data = att)
summary(fit3)
```
The `netlag()`-term allows for more distand network effects:
```{r}
att$avg.grades.ofriendoffriends <- netlag(att$grade.math.french, mat, normalization = 'row', pathdist = 2)[,1]
```
```{r}
fit4 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ avg.grades.ofriendoffriends,
data = att)
summary(fit4)
```
### Checking for Clique effects
Instead of checking for correlations between ego's behavior and the average behavior of ego's friends you can also check for clique effects: here, the average performance is calculated not only for direct friends but also for indirect friends that are part of ego's clique.
```{r}
att$clique.grades <- cliquelag(att$grade.math.french, mat)[,1]
fit4 <- lm(grade.math.french ~
age
+ gender
+ clique.grades,
data = att)
summary(fit4)
```
### Checking for centrality effects
Next we can check if the network position of a student affects their performance. We'll test indegree-, outdegree- and betweenness-centrality.
```{r}
att$indegree <- degree(mat, cmode = 'indegree')
fit5a <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ indegree,
data = att)
summary(fit5a)
att$outdegree <- degree(mat, cmode = 'outdegree')
fit5b <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ outdegree,
data = att)
summary(fit5b)
att$betweenness <- betweenness(mat)
fit5c <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ betweenness,
data = att)
summary(fit5c)
```
None of them seem to correlate with students' performance.
### Attribute similarity
You can also check whether students who share some attribute also share performance scores.
We could test this using the 'sport.hours' variable that measures the hours a student spends doing sports/exercises.
```{r, echo=FALSE, eval=FALSE}
source("data_literatur_varia/fun-error.R")
```
```{r, eval=FALSE, eval=FALSE}
att$attrsim.sporthr <- attribsim(att$grade.math.french, att$sport.hours, match = FALSE)[,1]
fit6 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ attrsim.sporthr,
data = att)
```
Sadly the `tnam`-package has an bug in this function for now. It will be solved soon (I already reported the bug).
### A fuller model
Controlling for gender and age is probably an underspecification, which leads to biased estimates (i.e., coefficients and standard errors). For sake of teaching a sparse model was chosen, but:
```{block, type = 'rmdright'}
always make sure you include all (potential) control variables! Otherwise your results will be biased.
```
You can check your regression by:
- comparing BIC scores between models (lower BIC = better model)
- comparing R^2 between models
- checking prediction performance of your model (advanced topic)
If you neglect this, you will interpret a model that does not explain your data well. Your results will be biased - meaning that the coefficients may be wrong in size or your standard errors too small (or too large). If you are unsure whether or not you should include another variable do this:
Run two models: one with the variable and one without. Then check if the fit is better (lower BIC score) and if your effects (i.e., coefficients) change drastically.
Check out this lovely YouTube-Video on variable selection in multiple regression models: https://www.youtube.com/watch?v=HP3RhjLhRjY.
```{r}
fit7 <- lm(grade.math.french ~ age
+ gender
+ liking.school #yes/no-dummy
+ tv.hours #hours of tv/week
+ reading.hours #hours of reading/week
+ agree.earnalot #want to earn alot? scale: no = 1, yes = 4
+ true.parentsinterested #are your parents interested in your life? scale: no = 1, yes = 4
+ avg.grade.ofriends,
data = att)
summary(fit7)
```
If you want to compare models, use the `texreg`-package:
```{r, results='asis'}
htmlreg(list(fit1, fit4, fit7),
stars = c(.1, 0.05, 0.01, 0.001))
```