-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path01-intro.Rmd
591 lines (481 loc) · 20.1 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
---
output:
pdf_document: default
html_document: default
---
```{r include=FALSE}
library(tidyverse)
```
# Introduction to R
We're assuming you're either new to R or need a refresher.
We'll start with some basic R operations entered directly in the console in RStudio.
## Variables
Variables are objects that store values. Every computer language, like in math, stores
values by assigning them constants or results of expressions.
`x <- 5` uses the R standard assignment operator `<-` though you can also use `=`.
We'll use `<-` because it is more common and avoids some confusion with other syntax.
```
```
Variable names must start with a letter, have no spaces, and not use any names
that are built into the R language or used in package libraries, such as
reserved words like `for` or function names like `log()`
```{r}
x <- 5
y <- 8
longitude <- -122.4
latitude <- 37.8
my_name <- "Inigo Montoya"
```
To check the value of a variable or other object, you can just enter the name in
the console, or even in the code in a code chunk.
```{r}
x
y
longitude
latitude
my_name
```
This is counter to the way printing out values work in programming, and you will
need to know how this method works as well because you will want to use your code
to develop tools that accomplish things, and there are also limitations to what you
can see by just naming variables.
To see the values of variables in programming mode, use the `print()` function,
or to concatenate character string output, use `paste()`:
```{r}
print(x)
print(y)
print(latitude)
paste("The location is latitude", latitude, "longitude", longitude)
paste("My name is", my_name, "-- Prepare to die.")
```
## Functions
Once you have variables or other objects to work with, most of your work
involves *functions* such as the well-known math functions
```
log10(100)
log(exp(5))
cos(pi)
sin(90 * pi/180)
```
Most of your work will involve functions and there are too many to name,
even in the base functions, not to mention all the packages we will want to use.
You will likely have already used the `install.packages()` and `library()` functions
that add in an array of other functions.
Later we'll also learn how to write our own functions, a capability that is easy to
accomplish and also gives you a sense of what developing your own package might be like.
**Arithmetic operators**
There are of course all the normal arithmetic operators (that are actually functions)
like + - * /. You're probably familiar with these from using equations in Excel if not
in some other programming language you may have learned. These operators look a bit different
from how they'd look when creating a nicely formatted equation.$\frac{NIR - R}{NIR + R}$
instead has to look like `(NIR-R)/(NIR+R)`. Similarly `*` *must* be used to multiply; there's no implied multiplication
that we expect in a math equation like $x(2+y)$ which would need to be written `x*(2+y)`.
In contrast to those four well-known operators, the symbol used to exponentiate -- raise to a power --
varies among programming languages. R uses ** so the the Pythagorean theorem $c^2=a^2+b^2$ would be written `c**2 = a**2 + b**2`
except for the fact that it wouldn't make sense as a statement to R.
We'll need to talk about expressions and statements.
## Expressions and Statements
The concepts of expressions and statements are very important to understand in any programming language.
An *expression* in R (or any programming language) has a *value* just like a variable has a value.
An expression will commonly combine variables and functions to be *evaluated* to derive the value
of the expression. Here are some examples of expressions:
```
5
x
x*2
sin(x)
sqrt(a**2 + b**2)
(-b+sqrt(b**2-4*a*c))/2*a
paste("My name is", aname)
```
Note that some of those expressions used previously assigned variables -- x, a, b, c, aname.
An expression can be entered in the console to display its current value.
```{r}
cos(pi)
print(cos(pi))
```
A *statement* in R does something. It represents a directive we're assigning to the computer, or
maybe the environment we're running on the computer (like RStudio, which then runs R). A simple
`print()` *statement* seems a lot like what we just did when we entered an expression in the console, but recognize that it *does something*:
```{r}
print("Hello, World")
```
Which is the same as just typing "Hello, World", but that's just because the job of the console is to display what we are looking for [where we are the ones *doing something*], or if our statement includes something to display.
Statements in R are usually put on one line, but you can use a semicolon to have multiple statements on one line, if desired:
```{r}
x <- 5; print(x); print(x**2)
```
Many (perhaps most) statements don't actually display anything. For instance:
```{r}
x <- 5
```
doesn't display anything, but it does assign the value 5 to the variable x, so it *does something*. It's an *assignment statement* and uses that special assignment operator `<-` . Most languages just use `=` which the designers of R didn't want to use, to avoid confusing it with the equal sign meaning "is equal to".
*An assignment statement assigns an expression to a variable.* If that variable already exists, it is reused with the new value. For instance it's completely legit (and commonly done in coding) to update the variable in an assignment statement. This is very common when using a counter variable:
```
i = i + 1
```
You're simply updating the index variable with the next value. This also illustrates why it's *not* an equation: $i=i+1$ doesn't work as an equation (unless i is actually $\infty$ but that's just really weird.)
And `c**2 = a**2 + b**2` doesn't make sense as an R statement because `c**2` isn't a variable to be created.
The `**` part is interpreted as *raise to a power*. What is to the left of the assignment operator `=` *must* be a variable to be assigned the value of the expression.
## Data Types
Variables, constants and other data elements in R have data types.
Common types are numeric and character.
```{r}
x <- 5
class(x)
class(4.5)
class("Fred")
```
### Integers
By default, R creates double-precision floating-point numeric variables
To create integer variables:
- append an L to a constant, e.g. `5L` is an integer 5
- convert with `as.integer`
We're going to be looking at various `as.` functions in R, more on that later,
but we should look at `as.integer()` now. Most other languages use `int()` for this,
and what it does is converts *any number* into an integer, *truncating* it to an
integer, not rounding it.
```{r}
as.integer(5)
as.integer(4.5)
```
To round a number, there's a `round()` function or you can easily use `as.integer` adding 0.5:
```{r}
x <- 4.8
y <- 4.2
as.integer(x + 0.5)
round(x)
as.integer(y + 0.5)
round(y)
```
Integer divison:
```{r}
5 %/% 2
```
Integer remainder from division (the modulus, using a `%%` to represent the modulo):
```{r}
5 %% 2
```
Surprisingly, the values returned by integer division or the remainder are not stored as integers. R seems to prefer floating point...
## Rectangular data
A common data format used in most types of research is *rectangular* data such as in a spreadsheet,
with rows and columns, where rows might be *observations* and columns might be *variables*.
We'll read this type of data in from spreadsheets or even more commonly from comma-separated-variable (CSV)
text files that spreadsheet programs like Excel commonly read in just like their native format.
```{r include=FALSE}
sierraFeb <- read_csv("data/sierraFeb.csv")
```
```{r}
sierraFeb
```
## Data Structures in R
We looked briefly at numeric and character string (we'll abbreviate simply as "string" from here on).
We'll also look at factors and dates/times later on.
### Vectors
A vector is an ordered collection of numbers, strings, vectors, data frames, etc.
What we mostly refer to as vectors are formally called *atomic vectors* which requires
that they be *homogeneous* sets of whatever type we're referring to, such as a vector of numbers,
or a vector of strings, or a vector of dates/times.
You can create a simple vector with the `c()` function:
```{r}
lats <- c(37.5,47.4,29.4,33.4)
lats
states = c("VA", "WA", "TX", "AZ")
states
zips = c(23173, 98801, 78006, 85001)
zips
```
The class of a vector is the type of data it holds
```{r}
```
```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7)
class(temp)
```
Vectors can only have one data class, and if mixed with character types, numeric elements will become character:
```{r}
mixed <- c(1, "fred", 7)
class(mixed)
mixed[3] # gets a subset, example of coercion
```
#### NA
Data science requires dealing with missing data by storing some sort of null value, called various things:
- null
- nodata
- NA "not available" or "not applicable"
```{r}
as.numeric(c("1","Fred","5")) # note NA introduced by coercion
```
Ignoring NA in statistical summaries is commonly used. Where normally the summary statistic can only return NA...
```{r}
mean(as.numeric(c("1", "Fred", "5")))
```
... with `na.rm=T` you can still get the result for all actual data:
```{r}
mean(as.numeric(c("1", "Fred", "5")), na.rm=T)
```
Don't confuse with `nan` ("not a number") which is used for things like imaginary numbers (explore the help for more on this)
```{r}
is.na(NA)
is.nan(NA)
is.na(as.numeric(''))
is.nan(as.numeric(''))
i <- sqrt(-1)
is.na(i) # interestingly nan is also na
is.nan(i)
```
#### Sequences
An easy way to make a vector from a sequence of values. The following 3 examples are equivalent:
```
seq(1,10)
c(1:10)
c(1,2,3,4,5,6,7,8,9,10)
```
The seq() function has special uses like using a step parameter:
```{r}
seq(2,10,2)
```
#### Vectorization and vector arithmetic
Arithmetic on vectors operates element-wise
```{r}
elev <- c(52,394,510,564,725,848,1042,1225,1486,1775,1899,2551)
elevft <- elev / 0.3048
elevft
```
Another example, with 2 vectors:
```{r}
temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
tempdiff <- temp03 - temp02
tempdiff
```
#### Plotting vectors
Vectors of Feb temperature, elevation and latitude at stations in the Sierra:
```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)
```
**Plot individually**
```{r fig.cap="Temperature"}
plot(temp)
```
```{r fig.cap="Elevation"}
plot(elev)
```
```{r fig.cap="Latitude"}
plot(lat)
```
**Then plot as a scatterplot**
```{r fig.cap="Temperature~Elevation"}
plot(elev,temp)
```
#### Named indices
Vector indices can be named.
```{r}
codes <- c(380, 124, 818)
codes
codes <- c(italy = 380, canada = 124, egypt = 818)
codes
str(codes)
```
Why? I guess so you can refer to observations by name instead of index.
The following are equivalent:
```{r}
codes[1]
codes["italy"]
```
### Lists
Lists can be heterogeneous, with multiple class types. Lists are actually used a lot in R, but we won't see them for a while.
### Matrices
Vectors are commonly used as a column in a matrix (or as we'll see, a data frame), like a variable
```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)
```
**Building a matrix from vectors as columns**
```{r}
sierradata <- cbind(temp, elev, lat)
class(sierradata)
```
#### Dimensions for arrays and matrices
Note: a matrix is just a 2D array. Arrays have 1, 3, or more dimensions.
```{r}
dim(sierradata)
```
```{r}
a <- 1:12
dim(a) <- c(3, 4) # matrix
class(a)
dim(a) <- c(2,3,2) # 3D array
class(a)
dim(a) <- 12 # 1D array
class(a)
b <- matrix(1:12, ncol=1) # 1 column matrix is allowed
```
### Data frames
A data frame is a database with fields (as vectors) with records (rows), so is very important for data analysis and GIS. They're kind of like a spreadsheet with rules (first row is field names, fields all one type). So even though they're more complex than a list, we use them so frequently they become quite familiar [whereas I continue to find lists confusing, especially when discovering them as what a particular function returns.]
```{r warning=FALSE}
library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)
```
**Creating a data frame out of a matrix**
```{r fig.cap="Temperature~Elevation"}
mydata <- as.data.frame(sierradata)
plot(data = mydata, x = elev, y = temp)
```
**Read a data frame from a CSV**
```{r}
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
TRI87
```
**Sort, Index, & Max/Min**
```{r}
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
head(sort(TRI87$air_releases))
index <- order(TRI87$air_releases)
head(TRI87$FACILITY_NAME[index]) # displays facilities in order of their air releases
i_max <- which.max(TRI87$air_releases)
TRI87$FACILITY_NAME[i_max] # was NUMMI at the time
```
### Factors
Factors are vectors with predefined values
- Normally used for categorical data.
- Built on an *integer* vector
- Levels are the set of predefined values.
```{r}
fruit <- factor(c("apple", "banana", "orange", "banana"))
fruit # note that levels will be in alphabetical order
class(fruit)
typeof(fruit)
```
An equivalent conversion:
```{r}
fruitint <- c(1, 2, 3, 2) # equivalent conversion
fruit <- factor(fruitint, labels = c("apple", "banana", "orange"))
str(fruit)
```
#### Categorical Data and Factors
While character data might be seen as categorical (e.g. "urban", "agricultural", "forest" land covers), to be used as categorical variables they must be made into factors.
```{r}
grain_order <- c("clay", "silt", "sand")
grain_char <- sample(grain_order, 36, replace = TRUE)
grain_fact <- factor(grain_char, levels = grain_order)
grain_char
grain_fact
```
To make a categorical variable a factor:
```{r}
fruit <- c("apples", "oranges", "bananas", "oranges")
farm <- c("organic", "conventional", "organic", "organic")
ag <- as.data.frame(cbind(fruit, farm))
ag$fruit <- factor(ag$fruit)
ag$fruit
```
**Factor example**
```{r}
sierraFeb$COUNTY <- factor(sierraFeb$COUNTY)
str(sierraFeb$COUNTY)
```
## Programming and Logic
Given the exploratory nature of the R language, we sometimes forget that it provides
significant capabilities as a programming language where we can solve more
complex problems by coding procedures and using logic to control the process
and handle a range of possible scenarios.
Programming languages are used for a wide range of purposes, from developing operating
systems built from low-level code to high-level *scripting* used to run existing functions
in libraries. R and Python are commonly used for scripting, and you may be familiar with
using arcpy to script ArcGIS geoprocessing tools. But whether low- or high-level, some common
operational structures are used in all computer programming languages:
- Conditional operations: *If* a condition is true, do this, and maybe otherwise do something *else*.
`if x!=0 {print(1/x)} else {print("Can't divide by 0")}`
- Loops
`for(i in 1:10) print(paste(i, 1/i))`
- Functions (defining your own then using it in your main script)
```{r}
turnright <- function(ang){(ang + 90) %% 360}
turnright(c(260, 270, 280))
```
**Free-standing scripts**
As we move forward, we'll be wanting to develop complete, free-standing scripts that have all of the needed libraries and data.
Your scripts should stand on their own. One example of this that may seem insignificant is using print() statements instead
of just naming the object or variable in the console. While that is common in exploratory work, we need to learn to
create free-standing scripts.
However, "free standing" still allows for loading libraries of functions we'll be using.
We're still talking about high-level (*scripting*), not low-level programming, so we can depend on those libraries that
any user can access by installing those packages. If we develop our own packages, we just need to provide the user the ability to install
those packages.
### Subsetting with logic
We'll use a package that includes data from
Irizarry, Rafael (2020) *Introduction to Data Science* section 2.13.1.
Identify all states with murder rates ≤ that of Italy.
```{r}
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population * 100000
i <- murder_rate <= 0.71
murders$abb[i]
```
**which**
```{r}
library(readr)
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
i <- which(TRI87$air_releases > 1e6)
TRI87$FACILITY_NAME[i]
```
**%in%**
```{r}
i <- TRI87$COUNTY %in% c("NAPA","SONOMA")
TRI87$FACILITY_NAME[i]
```
### Apply functions
There are many apply functions in R, and they largely obviate the need for looping. For instance:
- `apply` derives values at margins of rows and columns, e.g. to sum across rows or down columns
```{r}
# matrix apply – the same would apply to data frames
matrix12 <- 1:12
dim(matrix12) <- c(3,4)
rowsums <- apply(matrix12, 1, sum)
colsums <- apply(matrix12, 2, sum)
sum(rowsums)
sum(colsums)
zero <- sum(rowsums) - sum(colsums)
matrix12
```
Apply functions satisfy one of the needs that spreadsheets are used for. Consider how of ten you use sum, mean or similar functions in Excel.
**`sapply`**
sapply applies functions to either:
- all elements of a vector – unary functions only
```{r}
sapply(1:12, sqrt)
```
- or all variables of a data frame (not a matrix), where it works much like a column-based apply (since variables are columns) but more easily interpreted without the need of specifying columns with 2:
```{r}
sapply(cars,mean) # same as apply(cars,2,mean)
```
```{r}
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
sapply(as.data.frame(cbind(temp02,temp03)),mean) # has to be a data frame
```
While various `apply` functions are in base R, the purrr package takes these further.
See: <a href="https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf">purrr cheat sheet</a>
## Exercises
1. Assign variables for your name, city, state and zip code, and use `paste()` to combine them, and assign them to the variable `me`. What is the class of `me`?
2. Knowing that trigonometric functions require angles (including azimuth directions) to be provided in radians, and that degrees can be converted into radians by dividing by 180 and multiplying that by pi, derive the sine of 30 degrees with an R expression. (Base R knows what pi is, so you can just use `pi`)
3. If two sides of a right triangle on a map can be represented as $dX$ and $dY$ and the direct line path between them $c$, and the coordinates of 2 points on a map might be given as $(x1,y1)$ and $(x2,y2)$, with $dX=x2-x1$ and $dY=y2-y1$, use the Pythagorean theorem to derive the distance between them and assign that expression to $c$.
4. You can create a vector uniform random numbers from 0 to 1 using `runif(n=30)` where n=30 says to make 30 of them. Use the `round()` function to round each of the values, and provide what you created and explain what happened.
5. Create two vectors of 10 numbers each with the c() function, then assigning to x and y. Then plot(x,y), and provide the three lines of code you used to do the assignment and plot.
6. Change your code from #5 so that one value is NA (entered simply as `NA`, no quotation marks), and derive the mean value for x. Then add `,na.rm=T` to the parameters for `mean()`. Also do this for y. Describe your results and explain what happens.
7. Create two sequences, `a` and `b`, with `a` all odd numbers from 1 to 99, `b` all even numbers from 2 to 100. Then derive c through vector division of `b/a`. Plot a and c together as a scatterplot.
8. Build the sierradata data frame from the data at the top of the **Matrices** section, also given here:
```
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)
```
Create a data frame from it using the same steps, and plot temp against latitude.
9. From the `sierradata` matrix built with `cbind()`, derive colmeans using the `mean` parameter on the columns `2` for `apply()`.
10. Do the same thing with the sierra data data frame with `sapply()`.