Gaston Sanchez
- Work with vectors of different data types
- Understand the concept of atomic structures
- Learn how to subset and slice R vectors
- Understand the concept of vectorization
- Understand recycling rules in R
- Write your descriptions, explanations, and code in an
Rmd
(R markdown) file. - Name this file as
lab02-first-last.Rmd
, wherefirst
andlast
are your first and last names (e.g.lab02-gaston-sanchez.Rmd
). - Knit your
Rmd
file as an html document (default option). - Submit your
Rmd
andhtml
files to bCourses, in the corresponding lab assignment. - Due date displayed in the syllabus (see github repo).
In this lab, you are going to work with a handful of variables about NBA players from the regular season 2017-2018:
player
: name of player.team
: team name abbreviation.position
: player position.age
: age of player.experience
: years of experience in NBA.salary
: salary (in dollars).scored
: total scored points.points1
: number of free throws, worth 1 point each.points2
: number of 2-point field goals, worth 2 points each.points3
: number of 3-point field goals, worth 3 points each.
The data is in the file nba2018-salary-points.RData
, located in the
github repository https://github.com/ucb-stat133/stat133-labs
. The
original source of the data is the website
www.basketball-reference.com
Open a new session in Rstudio, and make sure you have a clean workspace by typing this command on the console:
# remove existing objects
rm(list = ls())
You can download the .RData
file to your working directory, and then
load()
it with the code below. Do NOT include these commands in your
source Rmd file; simply type them directly on the console:
# download RData file into your working directory
rdata <- "https://github.com/ucb-stat133/stat133-labs/raw/master/data/nba2018-salary-points.RData"
download.file(url = rdata, destfile = 'nba2018-salary-points.RData')
The function download.file()
allows you to download any type of file
from the Web. In this case you are downloading the file called
nba2018-salary-points.RData
which is located in the github repository
of the course. This file is a binary file. To be more precise, the file
extension .RData
is the default extension used by R for its binary
native format.
Where does the file get downloaded? By default, the file
nba2018-salary-points.RData
gets downloaded to your working
directory. If you are curious about what is the current directory to
which R is paying attention to, simply type the function getwd()
—which
stands for get the working directory.
If you want to specify a specific location for the downloaded file, then
modify the destfile
parameter. For instance, if you are using Mac, and
you want the file to be downloaded to your desktop, you can use:
# download RData file to your Desktop (assuming you use Mac)
rdata <- "https://github.com/ucb-stat133/stat133-labs/raw/master/data/nba2018-salary-points.RData"
download.file(url = rdata, destfile = '~/Desktop/nba2018-salary-points.RData')
To load or import the contents of the binary file into your R session
you use load()
. This function allows you to import R binary files.
This time, include the code below in your Rmd
file:
# load data in your R session
load('nba2018-salary-points.RData')
Note: the code above will only work as long as your Rmd
file lives in
the same directory of the .RData
file. So far I’m assuming that you
have both files in your working directory.
Once you imported (or loaded) the data, use the function ls()
which
allows you to list all the available R objects:
# list the available objects with ls()
ls()
Create a vector four
by selecting the first four elements in player
:
four <- head(player, n = 4)
Single brackets [ ]
are used to subset (i.e. subscript, split)
vectors. Find out what happens if you specify:
- number one:
four[1]
- an index of zero:
four[0]
? - a negative index:
four[-1]
? - various negative indices:
four[-c(1,2,3)]
? - an index greater than the length of the vector:
four[5]
? - repeated indices:
four[c(1,2,2,3,3,3)]
?
Often, you will need to generate vectors of numeric sequences, like the
first five elements 1:5
, or from the first till the last element
1:length(player)
. R provides the colon operator :
, and the functions
seq()
, and rep()
to create various types of sequences.
Figure out how to use seq()
, rep()
, and bracket notation, to
extract:
- all the even elements in
player
- all the odd elements in
salary
- all multiples of 5 (e.g. 5, 10, 15, etc) of
team
- elements in positions 10, 20, 30, 40, etc of
scored
- all the even elements in
team
but this time in reverse order
Another kind of subsetting/subscripting is the so-called logical subsetting. This kind of subsetting typically takes place when making comparisons. A comparison operation occurs when you use comparison operators such as:
>
greater than>=
greater than or equal<
less than<=
less than or equal==
equal!=
different
For example:
scored_four <- scored[1:4]
# elements greater than 100
scored_four[scored_four > 100]
# elements less than 100
scored_four[scored_four < 100]
# elements less than or equal to 10
scored_four[scored_four <= 10]
# elements different from 10
scored_four[scored_four != 10]
In addition to using comparison operators, you can also use logical operators to produce a logical vector. The most common type of logical operators are:
&
AND|
OR!
negation
Run the following commands to see what R does:
# AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE
# OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE
# NOT
!TRUE
!FALSE
Logical operators allow you to combine several comparisons:
# players of Golden State (GSW)
player[team == 'GSW']
# name of players with salaries greater than 20 million dollars
player[salary > 20000000]
# name of players with scored points between 1000 and 1200 (exclusive)
player[scored > 1000 & points < 1200]
Write commands, using bracket notation, to answer the following
questions (you may need to use min()
, max()
, which()
,
which.min()
, which.max()
):
- players in position Center, of Warriors (GSW)
- players of both GSW (warriors) and LAL (lakers)
- players in positions Shooting Guard and Point Guards, of Lakers (LAL)
- subset Small Forwards of GSW and LAL
- name of the player with largest salary
- name of the player with smallest salary
- name of the player with largest number of scored points
- salary of the player with largest number of points
- largest salary of all Centers
- team of the player with the largest number of scored points
- name of the player with the largest number of 3-pointers
Use the function plot()
to make a scatterplot of scored
and salary
plot(scored, salary)
Keep in mind that plot()
is a generic function. This means that the
behavior of plot()
depends on the type of input. When you pass two
numeric vectors, plot()
will attempt to create a scatter plot.
The function plot()
produces static plots. But you can also try to get
interactive plots. One option to do this is via the R package
"plotly"
. To use this package, you will have to install "ggplot2"
and "plotly"
. Remember to use the install.packages()
function but do
NOT include it in your Rmd:
install.packages(c("ggplot2", "plotly"))
What you DO need to include is the library()
function to load
"plotly
“:
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The main function to graph data with "plotly"
is the plot_ly()
function. When your data is in vectors, you can graph a scatterplot like
this:
plot_ly(x = scored, y = salary, type = "scatter", mode = "markers")
By the way, the output of plot_ly()
will only work when you knit an
html file. If you try to knit using a different format, then plot_ly()
won’t work.
Looking at the generated plot, can you see any issues?
To get a better display of the scatterplot, let’s create two vectors
log_scored
and log_salary
by transforming scored
and salary
with
the logarithm function log()
log_scored <- log(scored)
log_salary <- log(salary)
Make another scatterplot but now use the log-transformed vectors:
plot(log_scored, log_salary)
To add the names of the players in the plot, you can use the low-level
graphing function text()
:
plot(log_scored, log_salary)
text(log_scored, log_salary, labels = player)
Now we have another problem. The labels in the plot are very messy. A
quick and dirty fix is to use abbreviate()
to shorten the displayed
names:
plot(log_scored, log_salary)
text(log_scored, log_salary, labels = abbreviate(player))
Your Turn: create a scatterplot of points and salary for the Warriors (GSW), displaying the names of the players. Generate two scatterplots, one with raw values (original scale, and another plot with log-transformations).
As mentioned before, vectors are the most essential type of data structure in R. They are atomic structures (can contain only one type of data): integers, real numbers, logical values, characters, complex numbers.
Related to vectors, there is another important data structure in R called factor. Factors are data structures exclusively designed to handle categorical data.
The object team
is an R factor. You can confirm this by using
is.factor()
or class()
is.factor(team)
## [1] TRUE
Use factor()
to create an object position_fac
by converting
position
into a factor:
position_fac <- factor(position)
If you have a factor, you can invoke table()
to get a table with the
frequencies (i.e. counts) of the factor categories or levels:
table(position_fac)
## position_fac
## C PF PG SF SG
## 97 98 96 84 102
Because factors are internally stored as integers, you can manipulate factors as any other vector:
position_fac[1:5]
## [1] C PF SG PG SF
## Levels: C PF PG SF SG
Practice manipulating position_fac
to get:
- positions of Warriors
- positions of players with salaries > 15 millions
- frequencies (counts) of positions with salaries > 15 millions
- relative frequencies (proportions) of ‘SG’ (Shooting Guards) in each team
Let’s go back to the scatterplot of scored
and salary
plot(scored, salary)
- Use your factor
position_fac
to add some color to the dots in the scatterplot. - Pass the factor to the
col =
parameter insideplot()
- Experiment with other
plot()
arguments like the point characterpch =
, the size of dots with the parametercex =
, the axes labelsxlab
andylab
and so on.