Gaston Sanchez
- Work with vectors of different data types
- Understand the concept of atomic structures
- Learn how to subset and slice R vectors
- Understand the concept of vectorization
- Understand recycling rules in R
In this lab, you are going to work with a handful of variables about NBA players from the regular season 2016-2017:
player
: name of the player.team
: team name abbreviation.position
: player position.salary
: salary (in dollars).points
: total scored points.points1
: number of free throws, worth 1 point each.points2
: number of 2-point field goals, worth 2 points each.points3
: number of 3-point field goals, worth 3 points each.
The data is in the file nba2017-salary-points.RData
, located in the data/
folder of this github repository. The original source of the data is the website www.basketball-reference.com
Open a new session in Rstudio, and make sure you have a clean workspace by typing this command on the console:
# remove existing objects
rm(list = ls())
You can download the .RData
file to your working directory, and then load()
it with the code below. Do NOT include these commands in your source Rmd file; simply type them directly on the console:
# download RData file into your working directory
rdata <- "https://github.com/ucb-stat133/stat133-fall-2017/raw/master/data/nba2017-salary-points.RData"
download.file(url = rdata, destfile = 'nba2017-salary-points.RData')
The function download.file()
allows you to download any type of file from the Web. In this case you are downloading the file called nba2017-salary-points.RData
which is located in the github repository of the course. This file is a binary file. To be more precise, the file extension .RData
is the default extension used by R for its binary native format.
Where does the file get downloaded? Bby default, the file nba2017-salary-points.RData
gets downloaded to your working directory. If you are curious about what is the current directory to which R is paying attention to, simply type the function getwd()
---which stands for get the working directory.
If you want to specify a specific location for the downloaded file, then modify the destfile
parameter. For instance, if you are using Mac, and you want the file to be downloaded to your desktop, you can use:
# download RData file to your Desktop (assuming you use Mac)
rdata <- "https://github.com/ucb-stat133/stat133-fall-2017/raw/master/data/nba2017-salary-points.RData"
download.file(url = rdata, destfile = '~/Desktop/nba2017-salary-points.RData')
To load or import the contents of the binary file into your R session you use load()
. This function allows you to import R binary files. This time, include the code below in your Rmd
file:
# load data in your R session
load('nba2017-salary-points.RData')
Note: the code above will only work as long as your Rmd
file lives in the same directory of the .RData
file. So far I'm assuming that you have both files in the working directory. If you run into problems, ask the GSI or any of the lab assistants.
Once you imported (or loaded) the data, use the function ls()
which allows you to list all the available R objects:
# list the available objects with ls()
ls()
Check that the following objects are available in your session:
player
team
position
salary
points
Most of the data sets you are going to be working with in real life are going to be stored in some sort of text file. So you probably won't be handling many R binary files (i.e. .RData
files). However, R binary files are an interesting option for saving intermediate results, and/or for saving R objects (as R objects).
Once you have some data objects to work with, the first step is to inspect some of their characteristics. R has various functions that allow you to examine objects:
typeof()
type of storage of any objectclass()
gives you the class of the objectstr()
displays the structure of an object in a compact waymode()
gives the data type (as used in R)object.size()
gives an estimate of the memory space used by an objectlength()
gives the length (i.e. number of elements)head()
take a peek at the first elementstail()
take a peek at the last elementssummary()
shows a summary of a given object
Your turn. Find out what is the class of each of the objects player
, team
, etc:
# your code here
Now use length()
, head()
, tail()
, and summary()
to start exploring the content of the objects:
# your code here
Do all the objects have the same length?
Vectors are the most basic type of data structures in R. Learning how to manipulate data structures in R requires you to start learning how to manipulate vectors.
How do you know if any of the loaded objects is a vector? Using class()
does not explicitly tell you whether an R object is a vector or not. A more explicit way to check if player
or position
are vectors is with the function is.vector()
.
# check whether the loaded objects are vectors
Remember that R vectors are atomic structures, which is just the fancy name to indicate that all the elements of a vector must be of the same type, either all numbers, all characters, or all logical values. To test if an object is atomic, i.e. has all its elements of the same type, use is.atomic()
# check whether the loaded objects are atomic objects
How do you know that a given vector is of a certain data type? One function to answer this question is typeof()
typeof(player)
typeof(position)
typeof(points)
You should know that among the R community, most useRs don't talk about types. Instead, because of historical reasons related to the S language, we talk about modes.
Subsetting refers to extracting elements of a vector (or another R object). To do so, you use what is known as bracket notation. This implies using (square) brackets [ ]
to get access to the elements of a vector, for instance:
# first element of 'player'
player[1]
# first five elements of 'player'
player[1:5]
What type of things can you specify inside the brackets? Basically:
- numeric vectors
- logical vectors (the length of the logical vector must match the length of the vector to be subset)
- character vectors (if the elements have names)
Here are some subsetting examples using a numeric vector inside the brackets:
# fifth element of 'player'
player[5]
# numeric range
player[2:8]
# numeric vector
player[c(1, 3, 5, 7)]
# different order
player[c(20, 9, 10, 50)]
# third element (four times)
player[rep(3, 4)]
Subset the first four players:
four <- player[1:4]
Now try these. What happens if you specify:
- an index of zero:
four[0]
? - a negative index:
four[-1]
? - various negative indices:
four[-c(1,2,3)]
? - an index greater than the length of the vector:
four[5]
? - repeated indices:
four[c(1,2,2,3,3,3)]
?
Often, you will need to generate vectors of numeric sequences, like the first five elements 1:5
, or from the first till the last element 1:length(player)
. R provides the colon operator :
, and the functions seq()
, and rep()
to create various types of sequences.
Figure out how to use seq()
, rep()
to extract:
- all the even elements in
player
- all the odd elements in
salary
- all multiples of 5 (e.g. 5, 10, 15, etc) of
team
- elements in positions 10, 20, 30, 40, etc of
points
- all the even elements in
team
but this time in reverse order
Logical subsetting involves using a logical vector inside the brackets. Learning about logical subsetting is a fundamental survival skill. This kind of subsetting is very powerful because it allows you to extract elements based on some logical condition.
Here's a toy example of logical subsetting:
# dummy vector
a <- c(5, 6, 7, 8)
# logical subsetting
a[c(TRUE, FALSE, TRUE, FALSE)]
## [1] 5 7
Logical subsetting occurs when the vector of indices that you pass inside the brackets is a logical vector.
To do logical subsetting, the vector that you put inside the brackets, should match the length of the manipulated vector. If you pass a shorter vector inside brackets, R will apply its recycling rules.
Notice that the elements of the vector that are subset are those which match the logical value TRUE
.
# your turn
four[c(TRUE, TRUE, TRUE, TRUE)]
four[c(TRUE, TRUE, FALSE, FALSE)]
four[c(FALSE, FALSE, TRUE, TRUE)]
four[c(TRUE, FALSE, TRUE, FALSE)]
four[c(FALSE, FALSE, FALSE, FALSE)]
# recycling
four[TRUE]
four[c(TRUE, FALSE)]
When subsetting a vector logically, most of the times you won't really be providing an explicit vector of TRUE
's and FALSE
s. Just imagine having a vector of 100 or 1000 or 1000000 elements, and trying to do logical subsetting by manually creating a logical vector of the same length. That would be very boring. Instead, you will be providing a logical condition or a comparison operation that returns a logical vector.
A comparison operation occurs when you use comparison operators such as:
>
greater than>=
greater than or equal<
less than<=
less than or equal==
equal!=
different
# elements greater than 6
a > 6
## [1] FALSE FALSE TRUE TRUE
# elements different from 6
a != 6
## [1] TRUE FALSE TRUE TRUE
Notice that a comparison operation always returns a logical vector. Here's how you actually extract the values of a vector based on logical indexing:
points_four <- points[1:4]
# elements greater than 100
points_four[points_four > 100]
## [1] 952 520 894
# elements less than 100
points_four[points_four < 100]
## [1] 10
# elements less than or equal to 10
points_four[points_four <= 10]
## [1] 10
# elements different from 10
points_four[points_four != 10]
## [1] 952 520 894
In addition to using comparison operators, you can also use logical operators to produce a logical vector. The most common type of logical operators are:
&
AND|
OR!
negation
Run the following commands to see what R does:
# AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE
# OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE
# NOT
!TRUE
!FALSE
Logical operators allow you to combine several comparisons:
# players of Golden State (GSW)
player[team == 'GSW']
# name of players with salaries greater than 20 million dollars
player[salary > 20000000]
# name of players with scored points between 1000 and 1200 (exclusive)
player[points > 1000 & points < 1200]
Write commands, using bracket notation, to answer the following questions (you may need to use min()
, max()
, which()
, which.min()
, which.max()
):
# players in position Center, of Warriors (GSW)
# players of both GSW (warriors) and LAL (lakers)
# players in positions Shooting Guard and Point Guards, of Lakers (LAL)
# subset Small Forwards of GSW and LAL
# name of the player with largest salary
# name of the player with smallest salary
# name of the player with largest number of points
# salary of the player with largest number of points
# largest salary of all Centers
# team of the player with the largest number of points
# name of the player with the largest number of 3-pointers
A third type of subsetting involves passing a character vector inside brackets. When you do this, the characters are supposed to be names of the manipulated vector.
None of the vectors player
, position
, points
, and salary
have names. You can confirm that with the names()
function applied on any of the vectors:
names(player)
Create a vector warriors_player
by selecting the players of team == 'GSW'
; then create a vector warriors_salary
with their salaries, and another vector warriors_points
with their points
. Assign warriors_player
as the names of warriors_salary
.
# create warriors_player
# create warriors_salary
# create warriors_points
You should have a vector warriors_salary
with named elements. Now you can use character subsetting:
warriors_salary["Andre Iguodala"]
warriors_salary[c("Stephen Curry", "Kevin Durant")]
Use the function plot()
to make a scatterplot of points
and salary
plot(points, salary)
Keep in mind that plot()
is a generic function. This means that the behavior of plot()
depends on the type of input. When you pass two numeric vectors, plot()
will attempt to create a scatter plot.
Looking at the generated plot, can you see any issues?
To get a better display of the scatterplot, let's create two vectors log_points
and log_salary
by transforming points
and salary
with the logarithm function log()
log_points <- log(points)
log_salary <- log(salary)
Make another scatterplot but now use the log-transformed vectors:
plot(log_points, log_salary)
To add the names of the players in the plot, you can use the low-level graphing function text()
:
plot(log_points, log_salary)
text(log_points, log_salary, labels = player)
Now we have another problem. The labels in the plot are very messy. A quick and dirty fix is to use abbreviate()
to shorten the displayed names:
plot(log_points, log_salary)
text(log_points, log_salary, labels = abbreviate(player))
Your Turn: create a scatterplot of points and salary for the Warriors (GSW), displaying the names of the players. Generate two scatterplots, one with raw values (original scale, and another plot with log-transformations).
When you create the vectors log_points <- log(points)
and log_salary <- log(salary)
, what you're doing is applying a function to a vector, which in turn acts on all elements of the vector.
This is called Vectorization in R parlance. Most functions that operate with vectors in R are vectorized functions. This means that an action is applied to all elements of the vector without the need to explicitly type commands to traverse all the elements.
In many other programming languages, you would have to use a set of commands to loop over each element of a vector (or list of numbers) to transform them. But not in R.
Another example of vectorization would be the calculation of the square root of all the elements in points
and salary
:
sqrt(points)
sqrt(salary)
If you are new to programming, learning about R's vectorization will be very natural (you won't stop to think about it too much). If you have some previous programming experience in other languages (e.g. C, python, perl), you know that vectorization does not tend to be a native thing.
Vectorization is essential in R. It saves you from typing many lines of code, and you will exploit vectorization with other useful functions known as the apply family functions (we'll talk about them later in the course).
Closely related with the concept of vectorization we have the notion of Recycling. To explain recycling let's see an example.
salary
is given in dollars, but what if you need to obtain the salaries in millions of dollars?. Create a new vector salary_millions
with the converted values in millions of dollars.
# your code here
Take the values in height
, given in inches, to create a new vector height_cms
that shows the values in centimeters; use the conversion of 1 inch = 2.54 centimeters.
# your code here
What you just did (assuming that you did things correctly) is called Recycling. To understand this concept, you need to remember that R does not have a data structure for scalars (single numbers). Scalars are in reality vectors of length 1.
Converting square kms to square miles requires this operation: area * 0.836
. Although it may not be obvious, we are multiplying two vectors: area
and 0.836
. Moreover (and more important) we are multiplying two vectors of different lengths!. So how does R know what to do in this cases?
Well, R uses the recycling rule, which takes the shorter vector (in this case 0.386
) and recycles its elements to form a temporary vector that matches the length of the longer vector (i.e. area
).
Here's another example of recycling. Heights of elements in an odd number position will be transformed to cms; heights of elements in an even number position will be transformed to meters:
units <- c(2.54, 0.0254)
mix_height <- height * units
The elements of units
are recycled and repeated as many times as elements in height
. The previous command is equivalent to this:
new_units <- rep(c(2.54, 0.0254), length.out = length(height))
height * new_units
As mentioned before, vectors are the most essential type of data structure in R. They are atomic structures (can contain only one type of data): integers, real numbers, logical values, characters, complex numbers.
Related to vectors, there is another important data structure in R called factor. Factors are data structures exclusively designed to handle categorical data.
The object team
is an R factor. You can confirm this by using is.factor()
or class()
is.factor(team)
## [1] TRUE
Use factor()
to create an object position_fac
by converting position
into a factor:
position_fac <- factor(position)
If you have a factor, you can invoke table()
to get a table with the frequencies (i.e. counts) of the factor categories or levels:
table(position_fac)
## position_fac
## C PF PG SF SG
## 89 89 85 83 95
Because factors are internally stored as integers, you can manipulate factors as any other vector:
position_fac[1:5]
## [1] C PF SG PG SF
## Levels: C PF PG SF SG
Practice manipulating position_fac
to get:
# positions of Warriors
# positions of players with salaries > 15 millions
# frequencies (counts) of positions with salaries > 15 millions
# relative frequencies (proportions) of 'SG' (Shooting Guards) in each team
#
Let's go back to the scatterplot of points
and salary
plot(points, salary)
But now use your factor position_fac
to add some color to the dots in the scatterplot. Pass the factor to the col =
parameter inside plot()
# your colored scatterplot
Experiment with other plot()
arguments like the point character pch =
, the size of dots with the parameter cex =
, the axes labels xlab
and ylab
and so on.
# class of objects
class(player)
## [1] "character"
class(team)
## [1] "factor"
class(position)
## [1] "character"
class(salary)
## [1] "numeric"
class(points)
## [1] "integer"
# check whether the loaded objects are vectors
is.vector(player)
## [1] TRUE
is.vector(team)
## [1] FALSE
is.vector(position)
## [1] TRUE
is.vector(salary)
## [1] TRUE
# all the even elements in `player`
player_even <- player[seq(from = 1, to = length(player), by = 2)]
# all the odd elements in `salary`
salary_odd <- salary[seq(from = 2, to = length(player), by = 2)]
# all multiples of 5 (e.g. 5, 10, 15, etc) of `team`
teams_by_5 <- team[seq(from = 5, to = length(player), by = 5)]
# elements in positions 10, 20, 30, 40, etc of `points`
points_by_10 <- points[seq(from = 10, to = length(player), by = 10)]
# centers of Warriors
player[team == 'GSW' & position == "C"]
## [1] "Damian Jones" "David West" "JaVale McGee" "Kevon Looney"
## [5] "Zaza Pachulia"
# players of both GSW and LAL
player[team == "GSW" | team == "LAL"]
## [1] "Andre Iguodala" "Damian Jones" "David West"
## [4] "Draymond Green" "Ian Clark" "James Michael McAdoo"
## [7] "JaVale McGee" "Kevin Durant" "Kevon Looney"
## [10] "Klay Thompson" "Matt Barnes" "Patrick McCaw"
## [13] "Shaun Livingston" "Stephen Curry" "Zaza Pachulia"
## [16] "Brandon Ingram" "Corey Brewer" "D'Angelo Russell"
## [19] "David Nwaba" "Ivica Zubac" "Jordan Clarkson"
## [22] "Julius Randle" "Larry Nance Jr." "Luol Deng"
## [25] "Metta World Peace" "Nick Young" "Tarik Black"
## [28] "Thomas Robinson" "Timofey Mozgov" "Tyler Ennis"
# Shooting Guards and Point Guards of Lakers (LAL)
player[team == 'LAL' & (position == "SG" | position == "PG")]
## [1] "D'Angelo Russell" "David Nwaba" "Jordan Clarkson"
## [4] "Nick Young" "Tyler Ennis"
# Small Forwards of GSW and LAL
player[position == "SF" & (team == "GSW" | team == "LAL")]
## [1] "Andre Iguodala" "Kevin Durant" "Matt Barnes"
## [4] "Brandon Ingram" "Corey Brewer" "Luol Deng"
## [7] "Metta World Peace"
# name of the player with largest salary
player[salary == max(salary)]
## [1] "LeBron James"
player[which.max(salary)]
## [1] "LeBron James"
# name of the player with smallest salary
player[salary == min(salary)]
## [1] "Edy Tavares"
player[which.min(salary)]
## [1] "Edy Tavares"
# name of the player with largest number of points
player[points == max(points)]
## [1] "Russell Westbrook"
player[which.max(points)]
## [1] "Russell Westbrook"
# salary of the player with largest number of points
salary[points == max(points)]
## [1] 26540100
salary[which.max(points)]
## [1] 26540100
# largest salary of all Centers
max(salary[position == "C"])
## [1] 26540100
# team of the player with the largest number of points
team[points == max(points)]
## [1] OKC
## 30 Levels: ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL ... WAS
team[which.max(points)]
## [1] OKC
## 30 Levels: ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL ... WAS
# name of the player with the largest number of 3-pointers
player[points3 == max(points3)]
## [1] "Stephen Curry"
player[which.max(points3)]
## [1] "Stephen Curry"
warriors_players <- player[team == "GSW"]
warriors_salary <- salary[team == "GSW"]
warriors_points <- points[team == "GSW"]
# scatterplot original scale
plot(warriors_points, warriors_salary)
text(warriors_points, warriors_salary, labels = warriors_players)
# scatterplot with log-scale
plot(log(warriors_points), log(warriors_salary))
text(log(warriors_points), log(warriors_salary), labels = warriors_players)
# positions of Warriors
position_fac[team == 'GSW']
## [1] SF C C PF SG PF C SF C SG SF SG PG PG C
## Levels: C PF PG SF SG
# positions of players with salaries > 15 millions
position_fac[salary > 20000000]
## [1] C PF SF SG SG C PF SG C C SG SF PG C SF PF PG PF PG C PG SF C
## [24] PG PG C PF PF
## Levels: C PF PG SF SG
# frequencies (counts) of positions with salaries > 15 millions
table(position_fac[salary > 15000000])
##
## C PF PG SF SG
## 17 9 9 13 8
# relative frequencies (proportions) of 'SG' (Shooting Guards) in each team
prop.table(table(team[position_fac == 'SG']))
##
## ATL BOS BRK CHI CHO CLE
## 0.01052632 0.03157895 0.04210526 0.04210526 0.04210526 0.03157895
## DAL DEN DET GSW HOU IND
## 0.02105263 0.04210526 0.03157895 0.03157895 0.03157895 0.02105263
## LAC LAL MEM MIA MIL MIN
## 0.03157895 0.03157895 0.03157895 0.04210526 0.04210526 0.02105263
## NOP NYK OKC ORL PHI PHO
## 0.02105263 0.04210526 0.02105263 0.04210526 0.02105263 0.04210526
## POR SAC SAS TOR UTA WAS
## 0.04210526 0.05263158 0.05263158 0.03157895 0.02105263 0.03157895