Lab 2: Vectors and other data structures

Gaston Sanchez

Learning Objectives

Work with vectors of different data types

Understand the concept of atomic structures

Learn how to subset and slice R vectors

Understand the concept of vectorization

Understand recycling rules in R

Getting the Data File

In this lab, you are going to work with a handful of variables about NBA players from the regular season 2016-2017:

player: name of the player.
team: team name abbreviation.
position: player position.
salary: salary (in dollars).
points: total scored points.
points1: number of free throws, worth 1 point each.
points2: number of 2-point field goals, worth 2 points each.
points3: number of 3-point field goals, worth 3 points each.

The data is in the file nba2017-salary-points.RData, located in the data/ folder of this github repository. The original source of the data is the website www.basketball-reference.com

Open a new session in Rstudio, and make sure you have a clean workspace by typing this command on the console:

# remove existing objects
rm(list = ls())

You can download the .RData file to your working directory, and then load() it with the code below. Do NOT include these commands in your source Rmd file; simply type them directly on the console:

# download RData file into your working directory
rdata <- "https://github.com/ucb-stat133/stat133-fall-2017/raw/master/data/nba2017-salary-points.RData"
download.file(url = rdata, destfile = 'nba2017-salary-points.RData')

What's happening in the code above?

The function download.file() allows you to download any type of file from the Web. In this case you are downloading the file called nba2017-salary-points.RData which is located in the github repository of the course. This file is a binary file. To be more precise, the file extension .RData is the default extension used by R for its binary native format.

Where does the file get downloaded? Bby default, the file nba2017-salary-points.RData gets downloaded to your working directory. If you are curious about what is the current directory to which R is paying attention to, simply type the function getwd()---which stands for get the working directory.

If you want to specify a specific location for the downloaded file, then modify the destfile parameter. For instance, if you are using Mac, and you want the file to be downloaded to your desktop, you can use:

# download RData file to your Desktop (assuming you use Mac)
rdata <- "https://github.com/ucb-stat133/stat133-fall-2017/raw/master/data/nba2017-salary-points.RData"
download.file(url = rdata, destfile = '~/Desktop/nba2017-salary-points.RData')

Loading the data file

To load or import the contents of the binary file into your R session you use load(). This function allows you to import R binary files. This time, include the code below in your Rmd file:

# load data in your R session
load('nba2017-salary-points.RData')

Note: the code above will only work as long as your Rmd file lives in the same directory of the .RData file. So far I'm assuming that you have both files in the working directory. If you run into problems, ask the GSI or any of the lab assistants.

Once you imported (or loaded) the data, use the function ls() which allows you to list all the available R objects:

# list the available objects with ls()
ls()

Check that the following objects are available in your session:

player
team
position
salary
points

About `.RData` files

Most of the data sets you are going to be working with in real life are going to be stored in some sort of text file. So you probably won't be handling many R binary files (i.e. .RData files). However, R binary files are an interesting option for saving intermediate results, and/or for saving R objects (as R objects).

Inspecting the data objects

Once you have some data objects to work with, the first step is to inspect some of their characteristics. R has various functions that allow you to examine objects:

typeof() type of storage of any object
class() gives you the class of the object
str() displays the structure of an object in a compact way
mode() gives the data type (as used in R)
object.size() gives an estimate of the memory space used by an object
length() gives the length (i.e. number of elements)
head() take a peek at the first elements
tail() take a peek at the last elements
summary() shows a summary of a given object

Your turn. Find out what is the class of each of the objects player, team, etc:

# your code here

Now use length(), head(), tail(), and summary() to start exploring the content of the objects:

# your code here

Do all the objects have the same length?

Vectors in R

Vectors are the most basic type of data structures in R. Learning how to manipulate data structures in R requires you to start learning how to manipulate vectors.

How do you know if any of the loaded objects is a vector? Using class() does not explicitly tell you whether an R object is a vector or not. A more explicit way to check if player or position are vectors is with the function is.vector().

# check whether the loaded objects are vectors

Remember that R vectors are atomic structures, which is just the fancy name to indicate that all the elements of a vector must be of the same type, either all numbers, all characters, or all logical values. To test if an object is atomic, i.e. has all its elements of the same type, use is.atomic()

# check whether the loaded objects are atomic objects

How do you know that a given vector is of a certain data type? One function to answer this question is typeof()

typeof(player)
typeof(position)
typeof(points)

You should know that among the R community, most useRs don't talk about types. Instead, because of historical reasons related to the S language, we talk about modes.

Manipulating Vectors: Subsetting

Subsetting refers to extracting elements of a vector (or another R object). To do so, you use what is known as bracket notation. This implies using (square) brackets [ ] to get access to the elements of a vector, for instance:

# first element of 'player'
player[1]

# first five elements of 'player'
player[1:5]

What type of things can you specify inside the brackets? Basically:

numeric vectors
logical vectors (the length of the logical vector must match the length of the vector to be subset)
character vectors (if the elements have names)

Subsetting with Numeric Indices

Here are some subsetting examples using a numeric vector inside the brackets:

# fifth element of 'player'
player[5]

# numeric range
player[2:8]

# numeric vector
player[c(1, 3, 5, 7)]

# different order
player[c(20, 9, 10, 50)]

# third element (four times)
player[rep(3, 4)]

Subset the first four players:

four <- player[1:4]

Now try these. What happens if you specify:

an index of zero: four[0]?
a negative index: four[-1]?
various negative indices: four[-c(1,2,3)]?
an index greater than the length of the vector: four[5]?
repeated indices: four[c(1,2,2,3,3,3)]?

Often, you will need to generate vectors of numeric sequences, like the first five elements 1:5, or from the first till the last element 1:length(player). R provides the colon operator :, and the functions seq(), and rep() to create various types of sequences.

Figure out how to use seq(), rep() to extract:

all the even elements in player
all the odd elements in salary
all multiples of 5 (e.g. 5, 10, 15, etc) of team
elements in positions 10, 20, 30, 40, etc of points
all the even elements in team but this time in reverse order

Subsetting with Logical Indices

Logical subsetting involves using a logical vector inside the brackets. Learning about logical subsetting is a fundamental survival skill. This kind of subsetting is very powerful because it allows you to extract elements based on some logical condition.

Here's a toy example of logical subsetting:

# dummy vector
a <- c(5, 6, 7, 8)

# logical subsetting
a[c(TRUE, FALSE, TRUE, FALSE)]

## [1] 5 7

Logical subsetting occurs when the vector of indices that you pass inside the brackets is a logical vector.

To do logical subsetting, the vector that you put inside the brackets, should match the length of the manipulated vector. If you pass a shorter vector inside brackets, R will apply its recycling rules.

Notice that the elements of the vector that are subset are those which match the logical value TRUE.

# your turn
four[c(TRUE, TRUE, TRUE, TRUE)]
four[c(TRUE, TRUE, FALSE, FALSE)]
four[c(FALSE, FALSE, TRUE, TRUE)]
four[c(TRUE, FALSE, TRUE, FALSE)]
four[c(FALSE, FALSE, FALSE, FALSE)]

# recycling
four[TRUE]
four[c(TRUE, FALSE)]

When subsetting a vector logically, most of the times you won't really be providing an explicit vector of TRUE's and FALSEs. Just imagine having a vector of 100 or 1000 or 1000000 elements, and trying to do logical subsetting by manually creating a logical vector of the same length. That would be very boring. Instead, you will be providing a logical condition or a comparison operation that returns a logical vector.

A comparison operation occurs when you use comparison operators such as:

> greater than
>= greater than or equal
< less than
<= less than or equal
== equal
!= different

# elements greater than 6
a > 6

## [1] FALSE FALSE  TRUE  TRUE

# elements different from 6
a != 6

## [1]  TRUE FALSE  TRUE  TRUE

Notice that a comparison operation always returns a logical vector. Here's how you actually extract the values of a vector based on logical indexing:

points_four <- points[1:4]

# elements greater than 100
points_four[points_four > 100]

## [1] 952 520 894

# elements less than 100
points_four[points_four < 100]

## [1] 10

# elements less than or equal to 10
points_four[points_four <= 10]

## [1] 10

# elements different from 10
points_four[points_four != 10]

## [1] 952 520 894

In addition to using comparison operators, you can also use logical operators to produce a logical vector. The most common type of logical operators are:

& AND
| OR
! negation

Run the following commands to see what R does:

# AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE

# OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE

# NOT
!TRUE
!FALSE

Logical operators allow you to combine several comparisons:

# players of Golden State (GSW)
player[team == 'GSW']

# name of players with salaries greater than 20 million dollars
player[salary > 20000000]

# name of players with scored points between 1000 and 1200 (exclusive)
player[points > 1000 & points < 1200]

Your turn

Write commands, using bracket notation, to answer the following questions (you may need to use min(), max(), which(), which.min(), which.max()):

# players in position Center, of Warriors (GSW)


# players of both GSW (warriors) and LAL (lakers)


# players in positions Shooting Guard and Point Guards, of Lakers (LAL) 


# subset Small Forwards of GSW and LAL


# name of the player with largest salary


# name of the player with smallest salary


# name of the player with largest number of points


# salary of the player with largest number of points


# largest salary of all Centers


# team of the player with the largest number of points


# name of the player with the largest number of 3-pointers

Subsetting with Character Vectors

A third type of subsetting involves passing a character vector inside brackets. When you do this, the characters are supposed to be names of the manipulated vector.

None of the vectors player, position, points, and salary have names. You can confirm that with the names() function applied on any of the vectors:

names(player)

Create a vector warriors_player by selecting the players of team == 'GSW'; then create a vector warriors_salary with their salaries, and another vector warriors_points with their points. Assign warriors_player as the names of warriors_salary.

# create warriors_player


# create warriors_salary


# create warriors_points

You should have a vector warriors_salary with named elements. Now you can use character subsetting:

warriors_salary["Andre Iguodala"]

warriors_salary[c("Stephen Curry", "Kevin Durant")]

Some plotting

Use the function plot() to make a scatterplot of points and salary

plot(points, salary)

Keep in mind that plot() is a generic function. This means that the behavior of plot() depends on the type of input. When you pass two numeric vectors, plot() will attempt to create a scatter plot.

Looking at the generated plot, can you see any issues?

To get a better display of the scatterplot, let's create two vectors log_points and log_salary by transforming points and salary with the logarithm function log()

log_points <- log(points)
log_salary <- log(salary)

Make another scatterplot but now use the log-transformed vectors:

plot(log_points, log_salary)

To add the names of the players in the plot, you can use the low-level graphing function text():

plot(log_points, log_salary)
text(log_points, log_salary, labels = player)

Now we have another problem. The labels in the plot are very messy. A quick and dirty fix is to use abbreviate() to shorten the displayed names:

plot(log_points, log_salary)
text(log_points, log_salary, labels = abbreviate(player))

Your Turn: create a scatterplot of points and salary for the Warriors (GSW), displaying the names of the players. Generate two scatterplots, one with raw values (original scale, and another plot with log-transformations).

Vectorization

When you create the vectors log_points <- log(points) and log_salary <- log(salary), what you're doing is applying a function to a vector, which in turn acts on all elements of the vector.

This is called Vectorization in R parlance. Most functions that operate with vectors in R are vectorized functions. This means that an action is applied to all elements of the vector without the need to explicitly type commands to traverse all the elements.

In many other programming languages, you would have to use a set of commands to loop over each element of a vector (or list of numbers) to transform them. But not in R.

Another example of vectorization would be the calculation of the square root of all the elements in points and salary:

sqrt(points)
sqrt(salary)

Why should you care about vectorization?

If you are new to programming, learning about R's vectorization will be very natural (you won't stop to think about it too much). If you have some previous programming experience in other languages (e.g. C, python, perl), you know that vectorization does not tend to be a native thing.

Vectorization is essential in R. It saves you from typing many lines of code, and you will exploit vectorization with other useful functions known as the apply family functions (we'll talk about them later in the course).

Recycling

Closely related with the concept of vectorization we have the notion of Recycling. To explain recycling let's see an example.

salary is given in dollars, but what if you need to obtain the salaries in millions of dollars?. Create a new vector salary_millions with the converted values in millions of dollars.

# your code here

Take the values in height, given in inches, to create a new vector height_cms that shows the values in centimeters; use the conversion of 1 inch = 2.54 centimeters.

# your code here

What you just did (assuming that you did things correctly) is called Recycling. To understand this concept, you need to remember that R does not have a data structure for scalars (single numbers). Scalars are in reality vectors of length 1.

Converting square kms to square miles requires this operation: area * 0.836. Although it may not be obvious, we are multiplying two vectors: area and 0.836. Moreover (and more important) we are multiplying two vectors of different lengths!. So how does R know what to do in this cases?

Well, R uses the recycling rule, which takes the shorter vector (in this case 0.386) and recycles its elements to form a temporary vector that matches the length of the longer vector (i.e. area).

Another recycling example

Here's another example of recycling. Heights of elements in an odd number position will be transformed to cms; heights of elements in an even number position will be transformed to meters:

units <- c(2.54, 0.0254)
mix_height <- height * units

The elements of units are recycled and repeated as many times as elements in height. The previous command is equivalent to this:

new_units <- rep(c(2.54, 0.0254), length.out = length(height))
height * new_units

Factors

As mentioned before, vectors are the most essential type of data structure in R. They are atomic structures (can contain only one type of data): integers, real numbers, logical values, characters, complex numbers.

Related to vectors, there is another important data structure in R called factor. Factors are data structures exclusively designed to handle categorical data.

The object team is an R factor. You can confirm this by using is.factor() or class()

is.factor(team)

## [1] TRUE

Creating Factors

Use factor() to create an object position_fac by converting position into a factor:

position_fac <- factor(position)

If you have a factor, you can invoke table() to get a table with the frequencies (i.e. counts) of the factor categories or levels:

table(position_fac)

## position_fac
##  C PF PG SF SG 
## 89 89 85 83 95

Manipulating Factors

Because factors are internally stored as integers, you can manipulate factors as any other vector:

position_fac[1:5]

## [1] C  PF SG PG SF
## Levels: C PF PG SF SG

Practice manipulating position_fac to get:

# positions of Warriors


# positions of players with salaries > 15 millions


# frequencies (counts) of positions with salaries > 15 millions


# relative frequencies (proportions) of 'SG' (Shooting Guards) in each team


#

More Plots

Let's go back to the scatterplot of points and salary

plot(points, salary)

But now use your factor position_fac to add some color to the dots in the scatterplot. Pass the factor to the col = parameter inside plot()

# your colored scatterplot

Experiment with other plot() arguments like the point character pch =, the size of dots with the parameter cex =, the axes labels xlab and ylab and so on.

Solutions

# class of objects
class(player)

## [1] "character"

class(team)

## [1] "factor"

class(position)

## [1] "character"

class(salary)

## [1] "numeric"

class(points)

## [1] "integer"

# check whether the loaded objects are vectors
is.vector(player)

## [1] TRUE

is.vector(team)

## [1] FALSE

is.vector(position)

## [1] TRUE

is.vector(salary)

## [1] TRUE

# all the even elements in `player`
player_even <- player[seq(from = 1, to = length(player), by = 2)]

# all the odd elements in `salary`
salary_odd <- salary[seq(from = 2, to = length(player), by = 2)]

# all multiples of 5 (e.g. 5, 10, 15, etc) of `team`
teams_by_5 <- team[seq(from = 5, to = length(player), by = 5)]

# elements in positions 10, 20, 30, 40, etc of `points`
points_by_10 <- points[seq(from = 10, to = length(player), by = 10)]

# centers of Warriors
player[team == 'GSW' & position == "C"]

## [1] "Damian Jones"  "David West"    "JaVale McGee"  "Kevon Looney" 
## [5] "Zaza Pachulia"

# players of both GSW and LAL
player[team == "GSW" | team == "LAL"]

##  [1] "Andre Iguodala"       "Damian Jones"         "David West"          
##  [4] "Draymond Green"       "Ian Clark"            "James Michael McAdoo"
##  [7] "JaVale McGee"         "Kevin Durant"         "Kevon Looney"        
## [10] "Klay Thompson"        "Matt Barnes"          "Patrick McCaw"       
## [13] "Shaun Livingston"     "Stephen Curry"        "Zaza Pachulia"       
## [16] "Brandon Ingram"       "Corey Brewer"         "D'Angelo Russell"    
## [19] "David Nwaba"          "Ivica Zubac"          "Jordan Clarkson"     
## [22] "Julius Randle"        "Larry Nance Jr."      "Luol Deng"           
## [25] "Metta World Peace"    "Nick Young"           "Tarik Black"         
## [28] "Thomas Robinson"      "Timofey Mozgov"       "Tyler Ennis"

# Shooting Guards and Point Guards of Lakers (LAL) 
player[team == 'LAL' & (position == "SG" | position == "PG")]

## [1] "D'Angelo Russell" "David Nwaba"      "Jordan Clarkson" 
## [4] "Nick Young"       "Tyler Ennis"

# Small Forwards of GSW and LAL
player[position == "SF" & (team == "GSW" | team == "LAL")]

## [1] "Andre Iguodala"    "Kevin Durant"      "Matt Barnes"      
## [4] "Brandon Ingram"    "Corey Brewer"      "Luol Deng"        
## [7] "Metta World Peace"

# name of the player with largest salary
player[salary == max(salary)]

## [1] "LeBron James"

player[which.max(salary)]

## [1] "LeBron James"

# name of the player with smallest salary
player[salary == min(salary)]

## [1] "Edy Tavares"

player[which.min(salary)]

## [1] "Edy Tavares"

# name of the player with largest number of points
player[points == max(points)]

## [1] "Russell Westbrook"

player[which.max(points)]

## [1] "Russell Westbrook"

# salary of the player with largest number of points
salary[points == max(points)]

## [1] 26540100

salary[which.max(points)]

## [1] 26540100

# largest salary of all Centers
max(salary[position == "C"])

## [1] 26540100

# team of the player with the largest number of points
team[points == max(points)]

## [1] OKC
## 30 Levels: ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL ... WAS

team[which.max(points)]

## [1] OKC
## 30 Levels: ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL ... WAS

# name of the player with the largest number of 3-pointers
player[points3 == max(points3)]

## [1] "Stephen Curry"

player[which.max(points3)]

## [1] "Stephen Curry"

warriors_players <- player[team == "GSW"]
warriors_salary <- salary[team == "GSW"]
warriors_points <- points[team == "GSW"]

# scatterplot original scale
plot(warriors_points, warriors_salary)
text(warriors_points, warriors_salary, labels = warriors_players)

# scatterplot with log-scale
plot(log(warriors_points), log(warriors_salary))
text(log(warriors_points), log(warriors_salary), labels = warriors_players)

# positions of Warriors
position_fac[team == 'GSW']

##  [1] SF C  C  PF SG PF C  SF C  SG SF SG PG PG C 
## Levels: C PF PG SF SG

# positions of players with salaries > 15 millions
position_fac[salary > 20000000]

##  [1] C  PF SF SG SG C  PF SG C  C  SG SF PG C  SF PF PG PF PG C  PG SF C 
## [24] PG PG C  PF PF
## Levels: C PF PG SF SG

# frequencies (counts) of positions with salaries > 15 millions
table(position_fac[salary > 15000000])

## 
##  C PF PG SF SG 
## 17  9  9 13  8

# relative frequencies (proportions) of 'SG' (Shooting Guards) in each team
prop.table(table(team[position_fac == 'SG']))

## 
##        ATL        BOS        BRK        CHI        CHO        CLE 
## 0.01052632 0.03157895 0.04210526 0.04210526 0.04210526 0.03157895 
##        DAL        DEN        DET        GSW        HOU        IND 
## 0.02105263 0.04210526 0.03157895 0.03157895 0.03157895 0.02105263 
##        LAC        LAL        MEM        MIA        MIL        MIN 
## 0.03157895 0.03157895 0.03157895 0.04210526 0.04210526 0.02105263 
##        NOP        NYK        OKC        ORL        PHI        PHO 
## 0.02105263 0.04210526 0.02105263 0.04210526 0.02105263 0.04210526 
##        POR        SAC        SAS        TOR        UTA        WAS 
## 0.04210526 0.05263158 0.05263158 0.03157895 0.02105263 0.03157895

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lab02-vector-basics.md

lab02-vector-basics.md

Lab 2: Vectors and other data structures

Learning Objectives

Getting the Data File

What's happening in the code above?

Loading the data file

About `.RData` files

Inspecting the data objects

Vectors in R

Manipulating Vectors: Subsetting

Subsetting with Numeric Indices

Subsetting with Logical Indices

Your turn

Subsetting with Character Vectors

Some plotting

Vectorization

Why should you care about vectorization?

Recycling

Another recycling example

Factors

Creating Factors

Manipulating Factors

More Plots

Solutions

Files

lab02-vector-basics.md

Latest commit

History

lab02-vector-basics.md

File metadata and controls

Lab 2: Vectors and other data structures

Learning Objectives

Getting the Data File

What's happening in the code above?

Loading the data file

About .RData files

Inspecting the data objects

Vectors in R

Manipulating Vectors: Subsetting

Subsetting with Numeric Indices

Subsetting with Logical Indices

Your turn

Subsetting with Character Vectors

Some plotting

Vectorization

Why should you care about vectorization?

Recycling

Another recycling example

Factors

Creating Factors

Manipulating Factors

More Plots

Solutions

About `.RData` files