-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcondensed.R
448 lines (343 loc) · 19 KB
/
condensed.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
# condensed.R --------------------------------------------------------
# This document is a pared-down version of the part1.R and part2.R files,
# intended for a shorter workshop. There is no new content, just a light
# reorganization to focus less on basic data structures and more on getting up
# and running using packages and conducting analyses, intended to be delivered
# as a single 90-minute workshop.
# Condensed Agenda --------------------------------------------------------
# R and RStudio, why we code
# Creating projects, where files live
# Variables, vectors, indexing vectors
# Loading files -- absolute vs relative filepaths
# Installing and loading packages -- dplyr, sjPlot
# Inspecting data: head, tail, str, View
# Functions
# dplyr verbs: filter, select, mutate, group_by, arrange
# Pipes
# Statistical tests: t-test, ANOVA, post-hocs, linear regression, logistic
# If time, graphing with ggplot2
# Why Should We Code? -----------------------------------------------------
# We code to accomplish tasks that would be less efficient or impossible to do
# by hand. There are many different coding languages. Which one you use depends
# on the task at hand and the conventions of a given field. A lot of programming
# languages are flexible and *can* do a lot, but have different things that they
# excel at. I like to think of coding languages as being like utensils: they
# help us accomplish a task, and there are a lot of options, but some utensils
# are better at some jobs (e.g., spoon vs fork for eating soup).
# We're going to focus on R because that's a common option in psychological
# research. Why do people like using R/what makes it well-suited to
# psychological research? (1) It's open-source and free, meaning users like you
# and me can add additional functionality on top of base R, and it's free for
# anyone to use. (2) It has an active teaching community, so a lot of great
# resources for learning are freely available online. (3) Particularly thanks to
# Posit/RStudio, it's easy to set up projects and keep files organized. (4) It's
# excellent at statistical analyses and graphing.
# Some other options you may come across are SPSS and JASP, which are "point and
# click." These are powerful tools and can be totally fine for conducting
# analyses! They're just a little less flexible because you're limited to their
# existing functionality. They can also be less reproducible, with people
# clicking around and not saving their work. There are ways to save your work,
# which you should do if you're using them.
# Setting Up Projects -----------------------------------------------------
# It's important to keep files clean and organized when you're working on a
# project. I like to create folders dedicated to a given project, and then keep
# all files and data related to that project in a given folder. Let's create a
# folder for these workshops. Go to File -> New Project... -> New Directory ->
# New Project. Create a directory (AKA a folder) and choose where to put it, and
# click Create Project. This is where your files for these workshops will live!
# Let's make a file called day1.R (ctrl/cmd-shift-N or File -> New File -> R
# Script) for today's activities.
# Important things in RStudio to start: we have the console (runs code, doesn't
# save it), environment (shows all the objects you've created, right now we have
# none), and then the files/plots/help window. When you create a file, you also
# add a script editor.
# Variables, Vectors, and Indexing Vectors --------------------------------
# R is a flexible language that can perform a lot of tasks, from basic
# arithmetic up through advanced statistics analyses and graphing. It is often
# the case that we want to save the results of some work, either to use later on
# or just to save us from having to re-run things repeatedly. We can save
# results to a variable using the assignment operator, <- . I don't want to get
# into the nitty-gritty, but <- is convention when you're assigning values to
# variables, and = is typically used for function parameter definitions. Most
# important is that you are consistent.
# example 1: creating a variable that contains the result of addition
my_addition <- 35 + 24
print(my_addition)
# example 2: creating a variable that stores a string (text)
favourite_fruit <- "peach"
print(favourite_fruit)
# Let's take a moment here to talk about some coding best practices: commenting
# and spacing.
# I'm using comments in two ways right now: inline (next to code),
# and to create sections. Inline, I'm commenting with what I'm doing. This is
# not just to make it easier for others to read, it's for myself! You know what
# you're doing when you write your code, but when you come back days, weeks,
# months, years later, you won't always remember, so it's useful to have a quick
# description of what is happening and why. As an additional benefit, it can
# help people reading your code for the first time as well. With creating
# sections, I'm organizing my file so it's easy to navigate with a legend. These
# lines with a preceding # are comments, so won't be run when I run my code. If
# you follow up a comment with at least four hyphens (# ----), you create a new
# section. I have a ton of hyphens following my comment because it makes it
# easier to pick out sections when I scroll through the document, and it's the
# default of the automatic section creation function in R, which is ctrl-shift-R
# on a Mac.
# Notice that I'm spacing out everything. I have a space around the assignment
# operator <-, spaces between the addition/subtraction/multiplication/division
# sign and the numbers. This is because it's easier to read. For example, if I
# write:
2+2-1*8/3
# It's harder to read than:
2 + 2 - 1 * 8 / 3
# This is less noticeable with smaller lines like this, but when you have a
# bunch of text all at once, it's very noticeable when it's hard to read.
# R can store more than single values at once! It can also store multiple
# elements together, for example the grades in a class or groceries:
grades <- c(88, 89, 85, 91)
print(grades)
groceries <- c("cereal", "milk", "bananas", "dragonfruit", "chicken", "basil")
print(groceries)
# You can index the elements of a vector to see what is stored. In R, indexing
# starts at 1 (unlike, e.g., Python where the first position is indexed to 0).
# So to access the value "bananas" from the groceries vector:
groceries[3]
# A vector can only be one data type (integers, doubles i.e., decimal numbers,
# strings). If you try to put multiple types into one list, the data will be
# coerced into a common type. For example, grocery items and how many of each to
# get:
groceries_numbered <- c("cereal", 1, "milk", 1, "bananas", 7, "dragonfruit", 2, "chicken", 1, "basil", 20)
print(groceries_numbered) # all entries coerced to strings
# Lists (not vectors) can contain multiple types:
groceries_list <- list("cereal", 1, "milk", 1, "bananas", 7)
groceries_list
# Installing and Loading Packages -----------------------------------------
# You can use existing functions in base R or create your own functions, and you
# can *also* use functions written by other people. Recall that R is
# open-source, meaning anyone can write functions that can be freely used by
# other people. It's like Wikipedia in that anyone can write something, but
# there are editors who check that edits make sense. When people publish their
# functions to be used by others, those functions are contained in *packages*.
# To use a package and the functions therein, you need to (1) install the
# package, (2) attach the package to your R session (what we have open right now
# is an R session).
# You only need to install the package once (or when it is updated), but you
# need to attach it to every session you want to use it in. It's like any other
# software: you need to download Google Chrome once to have it on your computer
# (and sometimes you need to redownload to update it), but every time you want
# to use it you have to open it.
# Note that I usually put ALL packages I'll be using at the very top of the
# document, along with the data I'll be using.
# to install, need to do this once
install.packages("dplyr") # you need the quotation marks
install.packages("sjPlot")
# to attach to your session, need to do this every session
library(dplyr)
library("dplyr") # you can use quotation marks or not when attaching
library(sjPlot)
# dplyr stands for "data pliers," and is used for manipulating data. Let's look
# at some example datasets included in the dplyr package, starting with mtcars.
# This has information about cars' fuel consumption. We will use sjPlot later,
# when we are presenting the results of our statistical analyses.
# Inspecting Data ---------------------------------------------------------
# It is usually a good idea to look over your data when you read it in. Reasons
# include (1) familiarizing yourself with its structure, (2) briefly checking it
# read in correctly.
# The mtcars dataset is included in dplyr, which we've loaded, so it's already
# available to us. A few useful functions for looking at your data in R:
head(mtcars) # look at the first few rows
tail(mtcars) # look at the last few rows
summary(mtcars) # get an overview of values
str(mtcars) # overview of data types
View(mtcars) # view the entire dataset
# Data Manipulation with dplyr --------------------------------------------
# Unlike in Excel, you can't access the data directly when Viewing it. Why?
# Because it's better to change it using your code in a reproducible way (no
# wondering if values were changed, forgetting what was changed or moved, etc.),
# and that's where dplyr comes in! It's an R package used for data manipulation,
# with a few main functions: filter (choosing certain rows), select (choosing
# certain columns), mutate (adding/changing columns), arrange (order of rows),
# summarize (summarizing, as the name suggests).
# Filter
# Our dataset has information about cars' number of cylinders. Generally, more
# cylinders = more power. We can use equality/relational statements to filter
# the dataset to cars that have more than/less than/equal to a certain number of
# cylinders.
filter(mtcars, cyl == 6) # equal to, notice the two ==
filter(mtcars, cyl < 6) # less than 6
filter(mtcars, cyl > 6) # greater than 6
filter(mtcars, cyl <= 6) # less than or equal to
# Take cyl == 6. We can see that there are 7 rows. But if we call mtcars, it's
# still the full dataset.
mtcars
# That's because we didn't save it to a new dataset. If we want to work with a
# new dataset, we would need to save it to a variable.
mtcars_6cyl <- filter(mtcars, cyl == 6)
# Select
# So filter chooses certain rows. Select chooses certain columns! Let's say we
# just wanted to look at the car's miles per gallon (mpg) and cylinders (cyl).
# either way works
select(mtcars, mpg, cyl)
select(mtcars, c(mpg, cyl))
# again, if we wanted a new dataset, we would have to save it
mtcars_mpg_cyl <- select(mtcars, c(mpg, cyl))
# You can also *deselect* certain columns using a minus sign.
select(mtcars, -mpg) # removes mpg
select(mtcars, -c(mpg, cyl)) # removes mpg and cyl
# Mutate
# Mutate adds or changes column values. For example, we have the variable qsec
# which is the time it takes a car to go a quarter of a mile, in seconds. Maybe
# we want to know how quickly it can go one mile. Let's multiply by four.
mutate(mtcars, milesec = 4*qsec)
# Group By
# grouping is a powerful operation for data manipulation. Sometimes you want
# information about an entire group, for example maybe we want the average
# horsepower across cylinders. To do that, we first group the dataset by number
# of cylinders, then use the mutate function to calculate average horsepower for
# each group, and then ungroup and go on with our day.
ungroup(mutate(group_by(mtcars, cyl), hp_average = mean(hp)))
# You might notice that it's getting a little difficult to read! This is where
# the "pipe" operator comes in. In older versions of R, it looks like this %>%
# and you have to attach the packages dplyr or magrittr to your session to use
# it. In R 4.1 and newer, there is a pipe built into base R, which looks like
## |>. The pipe makes it easier to read your code, because it takes the result of
# what you just ran, and passes it on to the next thing you're going to run. Our
# same operation looks like this using the pipe:
mtcars |> # take the dataset and pass it forward
group_by(cyl) |> # group the dataset by cyl and pass it forward
mutate(hp_average = mean(hp)) |> # calculate the mean horsepower
ungroup() # ungroup the dataset and print the result
# again, if you want to save the result of your calculation, you need to save it
# to a variable
mtcars_hp <- mtcars |>
group_by(cyl) |>
mutate(hp_average = mean(hp)) |>
ungroup()
# Arrange
# Arrange organizes the rows in ascending or descending order. The default is
# ascending, i.e., lowest values at the top. If we want to see the lowest
# horsepower cars first:
arrange(mtcars, hp)
mtcars |> arrange(hp) # with the pipe
# highest horsepower, use desc() for descending
arrange(mtcars, desc(hp))
mtcars |> arrange(desc(hp))
# Summarize/Summarise (either spelling works)
# Summarize creates a new data frame summarizing some data. If the data are
# ungrouped, it returns one row. If the data are grouped, it returns one row per
# group. See some functions in the documentation: ?summarize.
mtcars |>
summarize(mean = mean(hp),
n = n())
# There are a bunch of other functions, as well, like renaming columns, removing
# them, doing rowwise operations, etc. https://dplyr.tidyverse.org/
# Statistical Analyses ----------------------------------------------------
# Let's do some common statistical analyses using a second dataset, starwars,
# which is included with dplyr and has information about 14 Star Wars characters
# like their height, mass, gender, species, etc. I saved it as a csv so we can
# practice loading files from outside of R.
# Option 1: relative filepath. RECOMMENDED.
starwars <- read.csv("starwars.csv")
# Option 2: absolute filepath. NOT RECOMMENDED. Your code will not work when
# someone else loads it because it will be referencing a specific filepath on
# *your* computer.
# starwars <- read.csv("/Users/yourname/R_tutorial/starwars.csv")
# first few rows
head(starwars)
# We'll focus on a few analyses that you might encounter or that you've seen
# before: t-tests, ANOVAs, linear and logistic regression. Then we'll look at
# displaying results nicely with sjPlot and graphing your regressions with
# marginaleffects.
# t-test
# Let's use a t-test to assess whether humans' weights are significantly
# different from other species.
unique(starwars$species) # we need to add a variable for human/non-human
starwars <- starwars |>
mutate(human = ifelse(species == "Human", "human", "nonhuman")) |>
mutate(human_droid_other = case_when(species == "Human" ~ "human",
species == "Droid" ~ "droid",
.default = "other"))
# In general, R formulas are "Y ~ X", which you can read as something like "Y
# predicted by X". So here, "mass 'predicted by' human status," which is to say
# "how does mass differ between the human and non-human species?"
t.test(mass ~ human, data = starwars)
# Linear regression
# you can use lm, also in the base stats package
# How are height and mass related?
# simple linear regression
model1 <- lm(mass ~ height, data = starwars)
summary(model1)
# We can add other predictors for multiple linear regression
model2 <- lm(mass ~ height + gender, data = starwars)
summary(model2)
# We can add other predictors for multiple linear regression
model3 <- lm(mass ~ height + gender, data = starwars)
summary(model3)
# anova
anova <- aov(mass ~ human_droid_other, data = starwars)
summary(anova)
# post-hoc tests (our result isn't significant, but just to illustrate how to do it)
TukeyHSD(anova)
# also note that a linear regression with a categorical predictor is just an anova, simpler to use one framework
anova_lm <- lm(mass ~ human_droid_other, data = starwars)
summary(anova_lm)
# # We won't discuss assumptions, but you can check with this package
# (https://easystats.github.io/performance/articles/check_model.html) or with
# any number of other R packages.
# Presenting Results: sjPlot and ggplot2 ----------------------------------
install.packages(c("sjPlot", "ggplot2")) # you have to do this once
library(sjPlot) # have to attach to session every time
library(ggplot2)
# You can print your model results in nice tables and visualize your regression
# coefficients using the sjPlot package. In particular, the function tab_model()
# prints a table ("tab") for your model, and plot_model() generates a plot.
# For our simple linear regression, model
tab_model(model1)
plot_model(model1)
# For our multiple linear regression, model2
tab_model(model2)
plot_model(model2)
# For our simple linear regression, categorical predictor, anova_lm
tab_model(anova_lm)
plot_model(anova_lm)
# For our anova
tab_model(anova)
# If time, graphing with ggplot2 ------------------------------------------
# ggplot stands for "grammar of graphics" plotting, which is the idea that just
# like languages have grammars and once you know that you can construct totally
# new sentences, graphs also have a "grammar," fundamental elements that you can
# use to build all kinds of graphs.
# The main ingredients in a graph are (1) the plotting region (this would be
# like a piece of paper, in the physical world) and (2) the shapes you're using
# in your graph (this is your type of graph) In ggplot2, the shapes are
# generally called "geom"s, and some examples of geoms are histograms,
# scatterplots, bar graphs, line graphs, boxplots, violin plots.
ggplot(starwars) + # this says "I'm making a plot", like getting out a piece of paper
geom_point(mapping = aes(x = height, y = mass)) # this is the "point" geom, for a scatterplot. mapping = aes(...) describes what's on your axes
# you can use mapping = aes() in the ggplot() function, in which case it applies
# to ALL geoms unless overwritten, or in an individual geom (just applies to
# that geom).
# There's a lot of graphing, we're not going to come close to doing everything
# possible. There are some resources in the tutorial notes document for further
# learning. The main things we're going to touch on are a few basic shapes and
# colours.
# Graph of height by gender.
ggplot(starwars, mapping = aes(x = gender, y = height)) +
geom_col()
# Some geoms distinguish between colour (the outside edge) and fill (the inside)
ggplot(starwars, mapping = aes(x = gender, colour = gender)) +
geom_bar()
ggplot(starwars, mapping = aes(x = gender, fill = gender)) +
geom_bar()
# some don't distinguish
ggplot(starwars) +
geom_point(mapping = aes(x = height, y = mass, colour = gender))
# you can also do data manipulation and then pipe into a graph. NOTE THE
# DIFFERENCE: ggplot uses plus signs (+) to join things together, not pipes
# (%>% or |>).
starwars |>
ggplot() +
geom_point(mapping = aes(x = height, y = mass, colour = gender))
starwars |>
filter(!is.na(gender)) |>
ggplot() +
geom_point(mapping = aes(x = height, y = mass, colour = gender))