01-intro.Rmd

---
output:
  pdf_document: default
  html_document: default
---
```{r include=FALSE}
library(tidyverse)
```

# Introduction to R

We're assuming you're either new to R or need a refresher.

We'll start with some basic R operations entered directly in the console in RStudio.

## Variables
Variables are objects that store values. Every computer language, like in math, stores
values by assigning them constants or results of expressions.
`x <- 5` uses the R standard assignment operator `<-` though you can also use `=`. 
We'll use `<-` because it is more common and avoids some confusion with other syntax.
```

```
Variable names must start with a letter, have no spaces, and not use any names 
that are built into the R language or used in package libraries, such as
reserved words like `for` or function names like `log()`
```{r}
x <- 5
y <- 8
longitude <- -122.4
latitude <- 37.8
my_name <- "Inigo Montoya"
```
To check the value of a variable or other object, you can just enter the name in 
the console, or even in the code in a code chunk. 
```{r}
x
y
longitude
latitude
my_name


```
This is counter to the way printing out values work in programming, and you will
need to know how this method works as well because you will want to use your code
to develop tools that accomplish things, and there are also limitations to what you
can see by just naming variables.

To see the values of variables in programming mode, use the `print()` function, 
or to concatenate character string output, use `paste()`: 
```{r}
print(x)
print(y)
print(latitude)
paste("The location is latitude", latitude, "longitude", longitude)
paste("My name is", my_name, "-- Prepare to die.")

```
## Functions
Once you have variables or other objects to work with, most of your work 
involves *functions* such as the well-known math functions
```
log10(100)
log(exp(5))
cos(pi)
sin(90 * pi/180)
```
Most of your work will involve functions and there are too many to name, 
even in the base functions, not to mention all the packages we will want to use. 
You will likely have already used the `install.packages()` and `library()` functions 
that add in an array of other functions.
Later we'll also learn how to write our own functions, a capability that is easy to
accomplish and also gives you a sense of what developing your own package might be like.

**Arithmetic operators**
There are of course all the normal arithmetic operators (that are actually functions)
like + - * /.  You're probably familiar with these from using equations in Excel if not 
in some other programming language you may have learned. These operators look a bit different
from how they'd look when creating a nicely formatted equation.$\frac{NIR - R}{NIR + R}$
instead has to look like `(NIR-R)/(NIR+R)`.  Similarly `*` *must* be used to multiply; there's no implied multiplication 
that we expect in a math equation like $x(2+y)$ which would need to be written `x*(2+y)`.

In contrast to those four well-known operators, the symbol used to exponentiate -- raise to a power -- 
varies among programming languages. R uses ** so the the Pythagorean theorem $c^2=a^2+b^2$ would be written `c**2 = a**2 + b**2` 
except for the fact that it wouldn't make sense as a statement to R. 
We'll need to talk about expressions and statements.

## Expressions and Statements

The concepts of expressions and statements are very important to understand in any programming language.

An *expression* in R (or any programming language) has a *value* just like a variable has a value.
An expression will commonly combine variables and functions to be *evaluated* to derive the value
of the expression. Here are some examples of expressions:
```
5
x
x*2
sin(x)
sqrt(a**2 + b**2)
(-b+sqrt(b**2-4*a*c))/2*a
paste("My name is", aname)
```

Note that some of those expressions used previously assigned variables -- x, a, b, c, aname. 

An expression can be entered in the console to display its current value.
```{r}
cos(pi)
print(cos(pi))
```

A *statement* in R does something. It represents a directive we're assigning to the computer, or
maybe the environment we're running on the computer (like RStudio, which then runs R). A simple
`print()` *statement* seems a lot like what we just did when we entered an expression in the console, but recognize that it *does something*:
```{r}
print("Hello, World")
```
Which is the same as just typing "Hello, World", but that's just because the job of the console is to display what we are looking for [where we are the ones *doing something*], or if our statement includes something to display.

Statements in R are usually put on one line, but you can use a semicolon to have multiple statements on one line, if desired:

```{r}
x <- 5; print(x); print(x**2)
```

Many (perhaps most) statements don't actually display anything. For instance:
```{r}
x <- 5
```
doesn't display anything, but it does assign the value 5 to the variable x, so it *does something*. It's an *assignment statement* and uses that special assignment operator `<-` .  Most languages just use `=` which the designers of R didn't want to use, to avoid confusing it with the equal sign meaning "is equal to". 

*An assignment statement assigns an expression to a variable.* If that variable already exists, it is reused with the new value. For instance it's completely legit (and commonly done in coding) to update the variable in an assignment statement.  This is very common when using a counter variable:
```
i = i + 1
```
You're simply updating the index variable with the next value. This also illustrates why it's *not* an equation:  $i=i+1$ doesn't work as an equation (unless i is actually $\infty$ but that's just really weird.)

And `c**2 = a**2 + b**2` doesn't make sense as an R statement because `c**2` isn't a variable to be created. 
The `**` part is interpreted as *raise to a power*.  What is to the left of the assignment operator `=` *must* be a variable to be assigned the value of the expression.

## Data Types
Variables, constants and other data elements in R have data types.
Common types are numeric and character.
```{r}
x <- 5
class(x)
class(4.5)
class("Fred")
```
### Integers
By default, R creates double-precision floating-point numeric variables 
To create integer variables:
- append an L to a constant, e.g. `5L` is an integer 5
- convert with `as.integer`
We're going to be looking at various `as.` functions in R, more on that later, 
but we should look at `as.integer()` now.  Most other languages use `int()` for this,
and what it does is converts *any number* into an integer, *truncating* it to an
integer, not rounding it. 

```{r}
as.integer(5)
as.integer(4.5)
```
To round a number, there's a `round()` function or you can easily use `as.integer` adding 0.5:
```{r}
x <- 4.8
y <- 4.2
as.integer(x + 0.5)
round(x)
as.integer(y + 0.5)
round(y)
```


Integer divison:
```{r}
5 %/% 2
```
Integer remainder from division (the modulus, using a `%%` to represent the modulo):
```{r}
5 %% 2
```
Surprisingly, the values returned by integer division or the remainder are not stored as integers.  R seems to prefer floating point...

## Rectangular data
A common data format used in most types of research is *rectangular* data such as in a spreadsheet,
with rows and columns, where rows might be *observations* and columns might be *variables*.
We'll read this type of data in from spreadsheets or even more commonly from comma-separated-variable (CSV)
text files that spreadsheet programs like Excel commonly read in just like their native format.
```{r include=FALSE}
sierraFeb <- read_csv("data/sierraFeb.csv")
```
```{r}
sierraFeb
```
## Data Structures in R
We looked briefly at numeric and character string (we'll abbreviate simply as "string" from here on).
We'll also look at factors and dates/times later on.

### Vectors
A vector is an ordered collection of numbers, strings, vectors, data frames, etc.
What we mostly refer to as vectors are formally called *atomic vectors* which requires
that they be *homogeneous* sets of whatever type we're referring to, such as a vector of numbers, 
or a vector of strings, or a vector of dates/times.

You can create a simple vector with the `c()` function:
```{r}
lats <- c(37.5,47.4,29.4,33.4)
lats
states = c("VA", "WA", "TX", "AZ")
states
zips = c(23173, 98801, 78006, 85001)
zips
```
The class of a vector is the type of data it holds

```{r}

```


```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7)
class(temp)
```
Vectors can only have one data class, and if mixed with character types, numeric elements will become character:
```{r}
mixed <- c(1, "fred", 7)
class(mixed)
mixed[3]   # gets a subset, example of coercion
```

#### NA
Data science requires dealing with missing data by storing some sort of null value, called various things:
- null
- nodata
- NA "not available" or "not applicable"
```{r}
as.numeric(c("1","Fred","5")) # note NA introduced by coercion
```
Ignoring NA in statistical summaries is commonly used. Where normally the summary statistic can only return NA...
```{r}
mean(as.numeric(c("1", "Fred", "5")))
```
... with `na.rm=T` you can still get the result for all actual data:
```{r}
mean(as.numeric(c("1", "Fred", "5")), na.rm=T)
```
Don't confuse with `nan` ("not a number") which is used for things like imaginary numbers (explore the help for more on this)

```{r}
is.na(NA)
is.nan(NA)
is.na(as.numeric(''))
is.nan(as.numeric(''))
i <- sqrt(-1)
is.na(i) # interestingly nan is also na
is.nan(i)
```
#### Sequences
An easy way to make a vector from a sequence of values.  The following 3 examples are equivalent:
```
seq(1,10)
c(1:10)
c(1,2,3,4,5,6,7,8,9,10)
```
The seq() function has special uses like using a step parameter:
```{r}
seq(2,10,2)
```
#### Vectorization and vector arithmetic
Arithmetic on vectors operates element-wise
```{r}
elev <- c(52,394,510,564,725,848,1042,1225,1486,1775,1899,2551)
elevft <- elev / 0.3048
elevft

```
Another example, with 2 vectors:
```{r}
temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
tempdiff <- temp03 - temp02
tempdiff

```
#### Plotting vectors
Vectors of Feb temperature, elevation and latitude at stations in the Sierra:
```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)

```

**Plot individually**
```{r fig.cap="Temperature"}
plot(temp)
```
```{r fig.cap="Elevation"}
plot(elev)
```
```{r fig.cap="Latitude"}
plot(lat)
```

**Then plot as a scatterplot**
```{r fig.cap="Temperature~Elevation"}
plot(elev,temp)
```
#### Named indices
Vector indices can be named.
```{r}
codes <- c(380, 124, 818)
codes
codes <- c(italy = 380, canada = 124, egypt = 818)
codes
str(codes)
```
Why?  I guess so you can refer to observations by name instead of index. 
The following are equivalent:
```{r}
codes[1]
codes["italy"]
```

### Lists
Lists can be heterogeneous, with multiple class types. Lists are actually used a lot in R, but we won't see them for a while.

### Matrices
Vectors are commonly used as a column in a matrix (or as we'll see, a data frame), like a variable
```{r}
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)
```
**Building a matrix from vectors as columns**
```{r}
sierradata <- cbind(temp, elev, lat)
class(sierradata)
```

#### Dimensions for arrays and matrices
Note:  a matrix is just a 2D array.  Arrays have 1, 3, or more dimensions.
```{r}
dim(sierradata)
```

```{r}
a <- 1:12
dim(a) <- c(3, 4)   # matrix
class(a)
dim(a) <- c(2,3,2)  # 3D array
class(a)
dim(a) <- 12        # 1D array
class(a)
b <- matrix(1:12, ncol=1)  # 1 column matrix is allowed

```
### Data frames
A data frame is a database with fields (as vectors) with records (rows), so is very important for data analysis and GIS.  They're kind of like a spreadsheet with rules (first row is field names, fields all one type). So even though they're more complex than a list, we use them so frequently they become quite familiar [whereas I continue to find lists confusing, especially when discovering them as what a particular function returns.]

```{r warning=FALSE}
library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)
```
**Creating a data frame out of a matrix**
```{r fig.cap="Temperature~Elevation"}
mydata <- as.data.frame(sierradata)
plot(data = mydata, x = elev, y = temp)
```

**Read a data frame from a CSV**
```{r}
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
TRI87
```
**Sort, Index, & Max/Min**
```{r}
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
head(sort(TRI87$air_releases))
index <- order(TRI87$air_releases)
head(TRI87$FACILITY_NAME[index])   # displays facilities in order of their air releases
i_max <- which.max(TRI87$air_releases)
TRI87$FACILITY_NAME[i_max]   # was NUMMI at the time
```

### Factors

Factors are vectors with predefined values
- Normally used for categorical data.
- Built on an *integer* vector
- Levels are the set of predefined values.

```{r}
fruit <- factor(c("apple", "banana", "orange", "banana"))
fruit   # note that levels will be in alphabetical order
class(fruit)
typeof(fruit)
```

An equivalent conversion:

```{r}
fruitint <- c(1, 2, 3, 2) # equivalent conversion
fruit <- factor(fruitint, labels = c("apple", "banana", "orange"))
str(fruit)

```

#### Categorical Data and Factors

While character data might be seen as categorical (e.g. "urban", "agricultural", "forest" land covers), to be used as categorical variables they must be made into factors.

```{r}
grain_order <- c("clay", "silt", "sand")
grain_char <- sample(grain_order, 36, replace = TRUE)
grain_fact <- factor(grain_char, levels = grain_order)
grain_char
grain_fact
```

To make a categorical variable a factor:
```{r}
fruit <- c("apples", "oranges", "bananas", "oranges")
farm <- c("organic", "conventional", "organic", "organic")
ag <- as.data.frame(cbind(fruit, farm))
ag$fruit <- factor(ag$fruit)
ag$fruit
```
**Factor example**
```{r}
sierraFeb$COUNTY <- factor(sierraFeb$COUNTY)
str(sierraFeb$COUNTY)
```

## Programming and Logic

Given the exploratory nature of the R language, we sometimes forget that it provides
significant capabilities as a programming language where we can solve more 
complex problems by coding procedures and using logic to control the process
and handle a range of possible scenarios.

Programming languages are used for a wide range of purposes, from developing operating
systems built from low-level code to high-level *scripting* used to run existing functions
in libraries. R and Python are commonly used for scripting, and you may be familiar with
using arcpy to script ArcGIS geoprocessing tools. But whether low- or high-level, some common
operational structures are used in all computer programming languages:

- Conditional operations: *If* a condition is true, do this, and maybe otherwise do something *else*.

  `if x!=0 {print(1/x)} else {print("Can't divide by 0")}`

- Loops

  `for(i in 1:10) print(paste(i, 1/i))`

- Functions (defining your own then using it in your main script)

```{r}
turnright <- function(ang){(ang + 90) %% 360}
turnright(c(260, 270, 280))
```

**Free-standing scripts**

As we move forward, we'll be wanting to develop complete, free-standing scripts that have all of the needed libraries and data.
Your scripts should stand on their own. One example of this that may seem insignificant is using print() statements instead
of just naming the object or variable in the console. While that is common in exploratory work, we need to learn to 
create free-standing scripts.  

However, "free standing" still allows for loading libraries of functions we'll be using. 
We're still talking about high-level (*scripting*), not low-level programming, so we can depend on those libraries that 
any user can access by installing those packages. If we develop our own packages, we just need to provide the user the ability to install
those packages. 

### Subsetting with logic

We'll use a package that includes data from 
Irizarry, Rafael (2020) *Introduction to Data Science* section 2.13.1.

Identify all states with murder rates ≤ that of Italy.


```{r}
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population * 100000
i <- murder_rate <= 0.71 
murders$abb[i]
```
**which**

```{r}
library(readr)
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
i <- which(TRI87$air_releases > 1e6)
TRI87$FACILITY_NAME[i]
```

**%in%**
```{r}
i <- TRI87$COUNTY %in% c("NAPA","SONOMA")
TRI87$FACILITY_NAME[i]

```

### Apply functions

There are many apply functions in R, and they largely obviate the need for looping.  For instance: 

- `apply` derives values at margins of rows and columns, e.g. to sum across rows or down columns
```{r}
# matrix apply – the same would apply to data frames
matrix12 <- 1:12
dim(matrix12) <- c(3,4)
rowsums <- apply(matrix12, 1, sum)
colsums <- apply(matrix12, 2, sum)
sum(rowsums)
sum(colsums)
zero <- sum(rowsums) - sum(colsums)
matrix12
```

Apply functions satisfy one of the needs that spreadsheets are used for.  Consider how of ten you use sum, mean or similar  functions in Excel.

**`sapply`**

sapply applies functions to either:

- all elements of a vector – unary functions only
```{r}
sapply(1:12, sqrt)
```

- or all variables of a data frame (not a matrix), where it works much like a column-based apply (since variables are columns) but more easily interpreted without the need of specifying columns with 2:

```{r}
sapply(cars,mean)  # same as apply(cars,2,mean)
```

```{r}
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
sapply(as.data.frame(cbind(temp02,temp03)),mean) # has to be a data frame
```

While various `apply` functions are in base R, the purrr package takes these further.  
See:  <a href="https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf">purrr cheat sheet</a>


## Exercises

1. Assign variables for your name, city, state and zip code, and use `paste()` to combine them, and assign them to the variable `me`. What is the class of `me`?

2. Knowing that trigonometric functions require angles (including azimuth directions) to be provided in radians, and that degrees can be converted into radians by dividing by 180 and multiplying that by pi, derive the sine of 30 degrees with an R expression.  (Base R knows what pi is, so you can just use `pi`)

3. If two sides of a right triangle on a map can be represented as $dX$ and $dY$ and the direct line path between them $c$, and the coordinates of 2 points on a map might be given as $(x1,y1)$ and $(x2,y2)$, with $dX=x2-x1$ and $dY=y2-y1$, use the Pythagorean theorem to derive the distance between them and assign that expression to $c$.

4. You can create a vector uniform random numbers from 0 to 1 using `runif(n=30)` where n=30 says to make 30 of them. Use the `round()` function to round each of the values, and provide what you created and explain what happened.

5. Create two vectors of 10 numbers each with the c() function, then assigning to x and y. Then plot(x,y), and provide the three lines of code you used to do the assignment and plot.

6. Change your code from #5 so that one value is NA (entered simply as `NA`, no quotation marks), and derive the mean value for x.  Then add `,na.rm=T` to the parameters for `mean()`. Also do this for y.  Describe your results and explain what happens.

7. Create two sequences, `a` and `b`, with `a` all odd numbers from 1 to 99, `b` all even numbers from 2 to 100. Then derive c through vector division of `b/a`.  Plot a and c together as a scatterplot.

8. Build the sierradata data frame from the data at the top of the **Matrices** section, also given here:
```
temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)
```
Create a data frame from it using the same steps, and plot temp against latitude.

9. From the `sierradata` matrix built with `cbind()`, derive colmeans using the `mean` parameter on the columns `2` for `apply()`.

10. Do the same thing with the sierra data data frame with `sapply()`.